# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
# env = UnityEnvironment(file_name="/home/aras/unity-envs/Banana_Linux/Banana.x86_64")
env = UnityEnvironment(file_name="/home/arasdar/unity-envs/Banana_Linux_NoVis/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
num_steps = 0
while True:
    num_steps += 1
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        print(state.shape)
        break
    
print("Score: {}".format(score))
num_steps

(37,)
Score: 2.0


300

When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
num_steps = 0
while True:
    num_steps += 1
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    #print(state)
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))
num_steps

Score: 2.0


300

In [8]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: /device:GPU:0


In [9]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
batch = []
num_steps = 0
while True: # infinite number of steps
    num_steps += 1
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    #print(state, action, reward, done)
    batch.append([state, action, next_state, reward, float(done)])
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))
num_steps

Score: -1.0


300

In [10]:
batch[0], batch[0][1]

([array([1.        , 0.        , 0.        , 0.        , 0.20790455,
         0.        , 0.        , 1.        , 0.        , 0.01327663,
         1.        , 0.        , 0.        , 0.        , 0.1713129 ,
         0.        , 0.        , 0.        , 1.        , 0.        ,
         0.        , 0.        , 1.        , 0.        , 0.01027161,
         0.        , 0.        , 1.        , 0.        , 0.02920904,
         0.        , 0.        , 1.        , 0.        , 0.00962341,
         0.        , 0.        ]),
  3,
  array([1.        , 0.        , 0.        , 0.        , 0.16192973,
         0.        , 0.        , 1.        , 0.        , 0.59092385,
         0.        , 0.        , 1.        , 0.        , 0.00976282,
         1.        , 0.        , 0.        , 0.        , 0.39982799,
         0.        , 0.        , 1.        , 0.        , 0.01006025,
         0.        , 0.        , 1.        , 0.        , 0.44728744,
         0.        , 0.        , 1.        , 0.        , 0.0180

In [11]:
batch[0]

[array([1.        , 0.        , 0.        , 0.        , 0.20790455,
        0.        , 0.        , 1.        , 0.        , 0.01327663,
        1.        , 0.        , 0.        , 0.        , 0.1713129 ,
        0.        , 0.        , 0.        , 1.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.01027161,
        0.        , 0.        , 1.        , 0.        , 0.02920904,
        0.        , 0.        , 1.        , 0.        , 0.00962341,
        0.        , 0.        ]),
 3,
 array([1.        , 0.        , 0.        , 0.        , 0.16192973,
        0.        , 0.        , 1.        , 0.        , 0.59092385,
        0.        , 0.        , 1.        , 0.        , 0.00976282,
        1.        , 0.        , 0.        , 0.        , 0.39982799,
        0.        , 0.        , 1.        , 0.        , 0.01006025,
        0.        , 0.        , 1.        , 0.        , 0.44728744,
        0.        , 0.        , 1.        , 0.        , 0.0180034 ,
        0.

In [12]:
states = np.array([each[1] for each in batch])
actions = np.array([each[0] for each in batch])
next_states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
# infos = np.array([each[4] for each in batch])

In [13]:
# print(rewards[:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print(np.max(np.array(actions)), np.min(np.array(actions)), 
      (np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print(np.max(np.array(rewards)), np.min(np.array(rewards)))
print(np.max(np.array(states)), np.min(np.array(states)))

(300, 37) (300,) (300, 37) (300,)
float64 int64 float64 float64
9.81694221496582 -10.490506172180176 21.307448387145996
9.81694221496582 -10.490506172180176
3 0


In [14]:
def model_input(state_size, lstm_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # RNN
    gru = tf.nn.rnn_cell.GRUCell(lstm_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    initial_state = cell.zero_state(batch_size, tf.float32)
    return states, actions, targetQs, cell, initial_state

In [15]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [16]:
def model_loss(action_size, hidden_size, states, cell, initial_state, actions, targetQs):
    actions_logits, final_state = generator(states=states, cell=cell, initial_state=initial_state, 
                                            lstm_size=hidden_size, num_classes=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [17]:
def model_opt(loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))
    return opt

In [18]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cell, self.initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cell=cell, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

In [19]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

In [20]:
# Network parameters
action_size = 4
state_size = 37
hidden_size = 37*2             # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 64            # memory capacity
batch_size = 64             # experience mini-batch size
gamma = 0.99                 # future reward discount

In [21]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

(?, 37) (?, 74)
(1, ?, 74) (1, 74)
(1, ?, 74) (1, 74)
(?, 74)
(?, 4)


In [22]:
# state = env.reset()
# for _ in range(batch_size):
#     action = env.action_space.sample()
#     next_state, reward, done, _ = env.step(action)
#     memory.buffer.append([state, action, next_state, reward, float(done)])
#     state = next_state
#     if done is True:
#         state = env.reset()

In [23]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]   # get the state
for _ in range(memory_size):
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    memory.buffer.append([state, action, next_state, reward, float(done)])
    memory.states.append(np.zeros([1, hidden_size])) # initial_states for rnn/mem
    state = next_state
    if done:                                       # exit loop if episode finished
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the state
        break

In [24]:
# initial_states = memory.states
memory.states[0].shape

(1, 74)

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        for num_steps in range(11111111111):
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            action = np.argmax(action_logits)
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            next_state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append(initial_state)
            total_reward += reward
            initial_state = final_state
            state = next_state
            
            # Training
            #batch, rnn_states = memory.sample(batch_size)
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            initial_states = memory.states
            next_actions_logits = sess.run(model.actions_logits,
                                           feed_dict = {model.states: next_states, 
                                                        model.initial_state: initial_states[1]})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                                     model.initial_state: initial_states[0]})
            loss_batch.append(loss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{}'.format(total_reward),
              'Steps:{}'.format(num_steps),
              'loss:{:.4f}'.format(np.mean(loss_batch)))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= +13:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:0.0000 R:0.0 Steps:299 loss:0.0292
Episode:1 meanR:-1.0000 R:-2.0 Steps:299 loss:0.0176
Episode:2 meanR:-1.0000 R:-1.0 Steps:299 loss:0.0139
Episode:3 meanR:-0.7500 R:0.0 Steps:299 loss:0.0239
Episode:4 meanR:-0.8000 R:-1.0 Steps:299 loss:0.0170
Episode:5 meanR:-0.8333 R:-1.0 Steps:299 loss:0.0079
Episode:6 meanR:-0.4286 R:2.0 Steps:299 loss:0.0143
Episode:7 meanR:-0.3750 R:0.0 Steps:299 loss:0.0275
Episode:8 meanR:0.1111 R:4.0 Steps:299 loss:0.0398
Episode:9 meanR:0.2000 R:1.0 Steps:299 loss:0.0271
Episode:10 meanR:0.4545 R:3.0 Steps:299 loss:0.0275
Episode:11 meanR:0.4167 R:0.0 Steps:299 loss:0.0266
Episode:12 meanR:0.2308 R:-2.0 Steps:299 loss:0.0287
Episode:13 meanR:0.2857 R:1.0 Steps:299 loss:0.0135
Episode:14 meanR:0.2667 R:0.0 Steps:299 loss:0.0075
Episode:15 meanR:0.1250 R:-2.0 Steps:299 loss:0.0150
Episode:16 meanR:0.1765 R:1.0 Steps:299 loss:0.0088
Episode:17 meanR:0.1667 R:0.0 Steps:299 loss:0.0057
Episode:18 meanR:0.1053 R:-1.0 Steps:299 loss:0.0113
Episode:

Episode:156 meanR:7.3200 R:7.0 Steps:299 loss:0.0578
Episode:157 meanR:7.3300 R:15.0 Steps:299 loss:0.0550
Episode:158 meanR:7.2700 R:7.0 Steps:299 loss:0.0957
Episode:159 meanR:7.2400 R:10.0 Steps:299 loss:0.0450
Episode:160 meanR:7.2400 R:4.0 Steps:299 loss:0.0851
Episode:161 meanR:7.2200 R:8.0 Steps:299 loss:0.0679
Episode:162 meanR:7.1600 R:7.0 Steps:299 loss:0.0450
Episode:163 meanR:7.2100 R:13.0 Steps:299 loss:0.0608
Episode:164 meanR:7.2600 R:15.0 Steps:299 loss:0.1260
Episode:165 meanR:7.2500 R:8.0 Steps:299 loss:0.0868
Episode:166 meanR:7.2000 R:2.0 Steps:299 loss:0.0674
Episode:167 meanR:7.2500 R:10.0 Steps:299 loss:0.0705
Episode:168 meanR:7.3300 R:10.0 Steps:299 loss:0.0600
Episode:169 meanR:7.3300 R:5.0 Steps:299 loss:0.0620
Episode:170 meanR:7.3500 R:5.0 Steps:299 loss:0.0500
Episode:171 meanR:7.4200 R:14.0 Steps:299 loss:0.0360
Episode:172 meanR:7.4800 R:9.0 Steps:299 loss:0.0616
Episode:173 meanR:7.5100 R:11.0 Steps:299 loss:0.0628
Episode:174 meanR:7.4700 R:6.0 Steps:2

Episode:310 meanR:8.4500 R:14.0 Steps:299 loss:0.0552
Episode:311 meanR:8.5000 R:7.0 Steps:299 loss:0.0681
Episode:312 meanR:8.4900 R:9.0 Steps:299 loss:0.0588
Episode:313 meanR:8.5100 R:6.0 Steps:299 loss:0.0567
Episode:314 meanR:8.5700 R:9.0 Steps:299 loss:0.0513
Episode:315 meanR:8.5000 R:0.0 Steps:299 loss:0.0443
Episode:316 meanR:8.4000 R:-2.0 Steps:299 loss:0.0352
Episode:317 meanR:8.4600 R:18.0 Steps:299 loss:0.0487
Episode:318 meanR:8.4700 R:11.0 Steps:299 loss:0.0930
Episode:319 meanR:8.6100 R:16.0 Steps:299 loss:0.0689
Episode:320 meanR:8.7000 R:9.0 Steps:299 loss:0.0805
Episode:321 meanR:8.7200 R:8.0 Steps:299 loss:0.0558
Episode:322 meanR:8.5800 R:4.0 Steps:299 loss:0.0445
Episode:323 meanR:8.5300 R:3.0 Steps:299 loss:0.0461
Episode:324 meanR:8.5900 R:16.0 Steps:299 loss:0.0537
Episode:325 meanR:8.4900 R:4.0 Steps:299 loss:0.0577
Episode:326 meanR:8.5100 R:10.0 Steps:299 loss:0.0674
Episode:327 meanR:8.4300 R:9.0 Steps:299 loss:0.0497
Episode:328 meanR:8.3400 R:4.0 Steps:29

Episode:464 meanR:8.8400 R:15.0 Steps:299 loss:0.0570
Episode:465 meanR:8.8800 R:6.0 Steps:299 loss:0.0549
Episode:466 meanR:8.9000 R:3.0 Steps:299 loss:0.0423
Episode:467 meanR:8.9500 R:12.0 Steps:299 loss:0.0583
Episode:468 meanR:8.9600 R:16.0 Steps:299 loss:0.0544
Episode:469 meanR:8.8700 R:4.0 Steps:299 loss:0.0520
Episode:470 meanR:8.8400 R:1.0 Steps:299 loss:0.0533
Episode:471 meanR:8.7200 R:7.0 Steps:299 loss:0.0506
Episode:472 meanR:8.6600 R:6.0 Steps:299 loss:0.0404
Episode:473 meanR:8.6200 R:7.0 Steps:299 loss:0.0449
Episode:474 meanR:8.6600 R:16.0 Steps:299 loss:0.0750
Episode:475 meanR:8.7300 R:17.0 Steps:299 loss:0.0799
Episode:476 meanR:8.6400 R:6.0 Steps:299 loss:0.0718
Episode:477 meanR:8.7400 R:16.0 Steps:299 loss:0.0698
Episode:478 meanR:8.7000 R:8.0 Steps:299 loss:0.0871
Episode:479 meanR:8.7900 R:11.0 Steps:299 loss:0.0740
Episode:480 meanR:8.8100 R:13.0 Steps:299 loss:0.0741
Episode:481 meanR:8.7900 R:8.0 Steps:299 loss:0.0794
Episode:482 meanR:8.8600 R:7.0 Steps:2

Episode:617 meanR:9.9600 R:8.0 Steps:299 loss:0.0793
Episode:618 meanR:10.0600 R:18.0 Steps:299 loss:0.0603
Episode:619 meanR:10.0900 R:11.0 Steps:299 loss:0.0903
Episode:620 meanR:10.1100 R:10.0 Steps:299 loss:0.0497
Episode:621 meanR:10.0700 R:4.0 Steps:299 loss:0.0555
Episode:622 meanR:10.0000 R:8.0 Steps:299 loss:0.0318
Episode:623 meanR:10.0400 R:13.0 Steps:299 loss:0.0446
Episode:624 meanR:10.0600 R:12.0 Steps:299 loss:0.0626
Episode:625 meanR:10.0400 R:10.0 Steps:299 loss:0.0517
Episode:626 meanR:9.9800 R:11.0 Steps:299 loss:0.0481
Episode:627 meanR:9.9500 R:8.0 Steps:299 loss:0.0569
Episode:628 meanR:10.0400 R:16.0 Steps:299 loss:0.0557
Episode:629 meanR:10.0400 R:9.0 Steps:299 loss:0.0788
Episode:630 meanR:10.0400 R:10.0 Steps:299 loss:0.0377
Episode:631 meanR:9.9900 R:6.0 Steps:299 loss:0.0498
Episode:632 meanR:9.8600 R:2.0 Steps:299 loss:0.0264
Episode:633 meanR:9.8900 R:13.0 Steps:299 loss:0.0529
Episode:634 meanR:9.8200 R:8.0 Steps:299 loss:0.0572
Episode:635 meanR:9.8200 

Episode:770 meanR:8.7700 R:8.0 Steps:299 loss:0.0645
Episode:771 meanR:8.7300 R:13.0 Steps:299 loss:0.0671
Episode:772 meanR:8.6800 R:9.0 Steps:299 loss:0.0806
Episode:773 meanR:8.5700 R:3.0 Steps:299 loss:0.0367
Episode:774 meanR:8.5500 R:10.0 Steps:299 loss:0.0496
Episode:775 meanR:8.4500 R:6.0 Steps:299 loss:0.0632
Episode:776 meanR:8.4800 R:13.0 Steps:299 loss:0.0878
Episode:777 meanR:8.3900 R:1.0 Steps:299 loss:0.0714
Episode:778 meanR:8.3000 R:4.0 Steps:299 loss:0.0409
Episode:779 meanR:8.2300 R:8.0 Steps:299 loss:0.0480
Episode:780 meanR:8.2700 R:12.0 Steps:299 loss:0.0717
Episode:781 meanR:8.2800 R:9.0 Steps:299 loss:0.0800
Episode:782 meanR:8.2200 R:9.0 Steps:299 loss:0.0544
Episode:783 meanR:8.1600 R:10.0 Steps:299 loss:0.0590
Episode:784 meanR:8.0700 R:4.0 Steps:299 loss:0.0562
Episode:785 meanR:8.1000 R:9.0 Steps:299 loss:0.0533
Episode:786 meanR:8.2200 R:15.0 Steps:299 loss:0.0861
Episode:787 meanR:8.3000 R:10.0 Steps:299 loss:0.0536
Episode:788 meanR:8.4300 R:13.0 Steps:2

Episode:923 meanR:9.2300 R:7.0 Steps:299 loss:0.0690
Episode:924 meanR:9.2100 R:1.0 Steps:299 loss:0.0640
Episode:925 meanR:9.2300 R:10.0 Steps:299 loss:0.0378
Episode:926 meanR:9.2500 R:10.0 Steps:299 loss:0.0491
Episode:927 meanR:9.1200 R:3.0 Steps:299 loss:0.0426
Episode:928 meanR:9.1400 R:9.0 Steps:299 loss:0.0371
Episode:929 meanR:9.2000 R:19.0 Steps:299 loss:0.0579
Episode:930 meanR:9.1500 R:8.0 Steps:299 loss:0.0675
Episode:931 meanR:9.0700 R:6.0 Steps:299 loss:0.0659
Episode:932 meanR:9.1800 R:12.0 Steps:299 loss:0.0398
Episode:933 meanR:9.0800 R:0.0 Steps:299 loss:0.0355
Episode:934 meanR:9.0700 R:11.0 Steps:299 loss:0.0512
Episode:935 meanR:9.0300 R:9.0 Steps:299 loss:0.0713
Episode:936 meanR:9.0900 R:14.0 Steps:299 loss:0.0650
Episode:937 meanR:9.0500 R:12.0 Steps:299 loss:0.0657
Episode:938 meanR:9.0500 R:8.0 Steps:299 loss:0.0673
Episode:939 meanR:9.0800 R:6.0 Steps:299 loss:0.0618
Episode:940 meanR:9.0800 R:16.0 Steps:299 loss:0.0562
Episode:941 meanR:9.0900 R:8.0 Steps:2

Episode:1075 meanR:10.3000 R:0.0 Steps:299 loss:0.0391
Episode:1076 meanR:10.3100 R:10.0 Steps:299 loss:0.0552
Episode:1077 meanR:10.3200 R:15.0 Steps:299 loss:0.0805
Episode:1078 meanR:10.3400 R:9.0 Steps:299 loss:0.0829
Episode:1079 meanR:10.4100 R:13.0 Steps:299 loss:0.0536
Episode:1080 meanR:10.5100 R:16.0 Steps:299 loss:0.0922
Episode:1081 meanR:10.5000 R:17.0 Steps:299 loss:0.0718
Episode:1082 meanR:10.5200 R:11.0 Steps:299 loss:0.0773
Episode:1083 meanR:10.4200 R:6.0 Steps:299 loss:0.0578
Episode:1084 meanR:10.3600 R:7.0 Steps:299 loss:0.0508
Episode:1085 meanR:10.2800 R:8.0 Steps:299 loss:0.0403
Episode:1086 meanR:10.3300 R:9.0 Steps:299 loss:0.0608
Episode:1087 meanR:10.3600 R:12.0 Steps:299 loss:0.0618
Episode:1088 meanR:10.3500 R:8.0 Steps:299 loss:0.0540
Episode:1089 meanR:10.3900 R:11.0 Steps:299 loss:0.0535
Episode:1090 meanR:10.3900 R:6.0 Steps:299 loss:0.0520
Episode:1091 meanR:10.2900 R:9.0 Steps:299 loss:0.0500
Episode:1092 meanR:10.2500 R:5.0 Steps:299 loss:0.0432
Ep

Episode:1225 meanR:9.8800 R:11.0 Steps:299 loss:0.0526
Episode:1226 meanR:9.7800 R:4.0 Steps:299 loss:0.0681
Episode:1227 meanR:9.7600 R:11.0 Steps:299 loss:0.0575
Episode:1228 meanR:9.7300 R:7.0 Steps:299 loss:0.0462
Episode:1229 meanR:9.7100 R:8.0 Steps:299 loss:0.0489
Episode:1230 meanR:9.8100 R:9.0 Steps:299 loss:0.0574
Episode:1231 meanR:9.9100 R:18.0 Steps:299 loss:0.0708
Episode:1232 meanR:9.8500 R:9.0 Steps:299 loss:0.0692
Episode:1233 meanR:9.8400 R:9.0 Steps:299 loss:0.0635
Episode:1234 meanR:9.8500 R:8.0 Steps:299 loss:0.0662
Episode:1235 meanR:9.7800 R:10.0 Steps:299 loss:0.0490
Episode:1236 meanR:9.8400 R:12.0 Steps:299 loss:0.0564
Episode:1237 meanR:9.8200 R:3.0 Steps:299 loss:0.0782
Episode:1238 meanR:9.8200 R:12.0 Steps:299 loss:0.0361
Episode:1239 meanR:9.8600 R:11.0 Steps:299 loss:0.0576
Episode:1240 meanR:10.0100 R:15.0 Steps:299 loss:0.0646
Episode:1241 meanR:10.1200 R:15.0 Steps:299 loss:0.0578
Episode:1242 meanR:10.1000 R:10.0 Steps:299 loss:0.0685
Episode:1243 me

Episode:1373 meanR:10.9700 R:13.0 Steps:299 loss:0.0586
Episode:1374 meanR:11.0000 R:10.0 Steps:299 loss:0.0765
Episode:1375 meanR:11.0600 R:18.0 Steps:299 loss:0.0708
Episode:1376 meanR:10.9700 R:3.0 Steps:299 loss:0.0798
Episode:1377 meanR:10.9400 R:8.0 Steps:299 loss:0.0412
Episode:1378 meanR:10.9900 R:14.0 Steps:299 loss:0.0606
Episode:1379 meanR:11.0100 R:11.0 Steps:299 loss:0.0621
Episode:1380 meanR:11.0200 R:13.0 Steps:299 loss:0.0546
Episode:1381 meanR:10.9800 R:13.0 Steps:299 loss:0.0853
Episode:1382 meanR:10.9800 R:14.0 Steps:299 loss:0.0644
Episode:1383 meanR:10.9400 R:10.0 Steps:299 loss:0.0616
Episode:1384 meanR:11.0000 R:12.0 Steps:299 loss:0.0532
Episode:1385 meanR:10.9900 R:14.0 Steps:299 loss:0.0482
Episode:1386 meanR:11.0400 R:18.0 Steps:299 loss:0.0540
Episode:1387 meanR:10.9800 R:9.0 Steps:299 loss:0.0699
Episode:1388 meanR:11.0000 R:9.0 Steps:299 loss:0.0549
Episode:1389 meanR:10.9100 R:9.0 Steps:299 loss:0.0667
Episode:1390 meanR:10.9700 R:13.0 Steps:299 loss:0.07

Episode:1522 meanR:9.7400 R:9.0 Steps:299 loss:0.0618
Episode:1523 meanR:9.7800 R:13.0 Steps:299 loss:0.0694
Episode:1524 meanR:9.7900 R:15.0 Steps:299 loss:0.0585
Episode:1525 meanR:9.8600 R:19.0 Steps:299 loss:0.0961
Episode:1526 meanR:9.8500 R:15.0 Steps:299 loss:0.1210
Episode:1527 meanR:9.9400 R:14.0 Steps:299 loss:0.0909
Episode:1528 meanR:9.9400 R:13.0 Steps:299 loss:0.0629
Episode:1529 meanR:10.0100 R:16.0 Steps:299 loss:0.0460
Episode:1530 meanR:10.0700 R:6.0 Steps:299 loss:0.0899
Episode:1531 meanR:10.1300 R:11.0 Steps:299 loss:0.0504
Episode:1532 meanR:10.0600 R:5.0 Steps:299 loss:0.0549
Episode:1533 meanR:10.0800 R:13.0 Steps:299 loss:0.0693
Episode:1534 meanR:10.0600 R:9.0 Steps:299 loss:0.0868
Episode:1535 meanR:10.0200 R:6.0 Steps:299 loss:0.0686
Episode:1536 meanR:10.0000 R:11.0 Steps:299 loss:0.0445
Episode:1537 meanR:10.0600 R:13.0 Steps:299 loss:0.0542
Episode:1538 meanR:10.1400 R:18.0 Steps:299 loss:0.0705
Episode:1539 meanR:10.0500 R:6.0 Steps:299 loss:0.0421
Episo

Episode:1671 meanR:10.1900 R:6.0 Steps:299 loss:0.0499
Episode:1672 meanR:10.0800 R:2.0 Steps:299 loss:0.0409
Episode:1673 meanR:10.0800 R:13.0 Steps:299 loss:0.0528
Episode:1674 meanR:10.1400 R:14.0 Steps:299 loss:0.0569
Episode:1675 meanR:10.1600 R:12.0 Steps:299 loss:0.0664
Episode:1676 meanR:10.0600 R:4.0 Steps:299 loss:0.0960
Episode:1677 meanR:9.9700 R:5.0 Steps:299 loss:0.0454
Episode:1678 meanR:9.9600 R:13.0 Steps:299 loss:0.0388
Episode:1679 meanR:9.9300 R:12.0 Steps:299 loss:0.0396
Episode:1680 meanR:9.8500 R:8.0 Steps:299 loss:0.0774
Episode:1681 meanR:9.8900 R:8.0 Steps:299 loss:0.0432
Episode:1682 meanR:9.9000 R:11.0 Steps:299 loss:0.0402
Episode:1683 meanR:10.0200 R:16.0 Steps:299 loss:0.0811
Episode:1684 meanR:10.0100 R:12.0 Steps:299 loss:0.0486
Episode:1685 meanR:10.1100 R:16.0 Steps:299 loss:0.0620
Episode:1686 meanR:10.0400 R:2.0 Steps:299 loss:0.0630
Episode:1687 meanR:10.0400 R:13.0 Steps:299 loss:0.0564
Episode:1688 meanR:9.9200 R:4.0 Steps:299 loss:0.0654
Episode

Episode:1821 meanR:10.1500 R:9.0 Steps:299 loss:0.0682
Episode:1822 meanR:10.0800 R:10.0 Steps:299 loss:0.0599
Episode:1823 meanR:10.1200 R:9.0 Steps:299 loss:0.0759
Episode:1824 meanR:10.1300 R:8.0 Steps:299 loss:0.0500
Episode:1825 meanR:10.0900 R:12.0 Steps:299 loss:0.0529
Episode:1826 meanR:10.1300 R:6.0 Steps:299 loss:0.0582
Episode:1827 meanR:10.0800 R:4.0 Steps:299 loss:0.0593
Episode:1828 meanR:10.1400 R:15.0 Steps:299 loss:0.0418
Episode:1829 meanR:10.0800 R:5.0 Steps:299 loss:0.0382
Episode:1830 meanR:10.2100 R:15.0 Steps:299 loss:0.0562
Episode:1831 meanR:10.2100 R:7.0 Steps:299 loss:0.0667
Episode:1832 meanR:10.2600 R:13.0 Steps:299 loss:0.0429
Episode:1833 meanR:10.1600 R:5.0 Steps:299 loss:0.0706
Episode:1834 meanR:10.1200 R:4.0 Steps:299 loss:0.0303
Episode:1835 meanR:10.0700 R:9.0 Steps:299 loss:0.0418
Episode:1836 meanR:10.0600 R:14.0 Steps:299 loss:0.0566
Episode:1837 meanR:10.0600 R:11.0 Steps:299 loss:0.0694
Episode:1838 meanR:10.0700 R:9.0 Steps:299 loss:0.0484
Epi

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Episode rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [30]:
# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Testing episodes/epochs
    for _ in range(1):
        total_reward = 0
        #state = env.reset()
        env_info = env.reset(train_mode=False)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state

        # Testing steps/batches
        while True:
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            action = np.argmax(action_logits)
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            total_reward += reward
            if done:
                break
                
        print('total_reward: {:.2f}'.format(total_reward))

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt


total_reward: 2.00


In [None]:
# Be careful!!!!!!!!!!!!!!!!
# Closing the env
env.close()