# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
env = UnityEnvironment(file_name="/home/arasdar/Banana_Linux/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        print(state.shape)
        break
    
print("Score: {}".format(score))

(37,)
Score: 0.0


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
batch = []
while True: # infinite number of steps
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    #print(state, action, reward, done)
    batch.append([action, state, reward, done])
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
# print("Score: {}".format(score))

In [8]:
batch[0], batch[0][1].shape

([2, array([0.        , 0.        , 1.        , 0.        , 0.16101955,
         1.        , 0.        , 0.        , 0.        , 0.04571758,
         1.        , 0.        , 0.        , 0.        , 0.2937662 ,
         0.        , 0.        , 1.        , 0.        , 0.14386636,
         0.        , 0.        , 1.        , 0.        , 0.16776823,
         1.        , 0.        , 0.        , 0.        , 0.04420976,
         1.        , 0.        , 0.        , 0.        , 0.05423063,
         0.        , 0.        ]), 0.0, False], (37,))

In [9]:
batch[0][1].shape

(37,)

In [10]:
batch[0]

[2, array([0.        , 0.        , 1.        , 0.        , 0.16101955,
        1.        , 0.        , 0.        , 0.        , 0.04571758,
        1.        , 0.        , 0.        , 0.        , 0.2937662 ,
        0.        , 0.        , 1.        , 0.        , 0.14386636,
        0.        , 0.        , 1.        , 0.        , 0.16776823,
        1.        , 0.        , 0.        , 0.        , 0.04420976,
        1.        , 0.        , 0.        , 0.        , 0.05423063,
        0.        , 0.        ]), 0.0, False]

In [11]:
actions = np.array([each[0] for each in batch])
states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
# infos = np.array([each[4] for each in batch])

In [12]:
# print(rewards[:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print(np.max(np.array(actions)), np.min(np.array(actions)), 
      (np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print(np.max(np.array(rewards)), np.min(np.array(rewards)))
print(np.max(np.array(states)), np.min(np.array(states)))

(300,) (300, 37) (300,) (300,)
float64 float64 int64 bool
3 0 4
1.0 0.0
10.806394577026367 -10.978788375854492


In [13]:
print('state size:{}'.format(states.shape), 
      'actions:{}'.format(actions.shape)) 
print('action size:{}'.format(np.max(actions) - np.min(actions)+1))

state size:(300, 37) actions:(300,)
action size:4


In [14]:
import gym
import random
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

from ddpg_agent import Agent

In [15]:
# env = gym.make('Pendulum-v0')
# env.seed(2)
agent = Agent(state_size=37, action_size=4, random_seed=2)

In [16]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]   # get the next state
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    state = next_state
    if done:                                       # exit loop if episode finished
        break

In [None]:
total_reward_list = []
total_reward_deque = deque(maxlen=100)
for ep in range(11111):
    #state = env.reset()
    env_info = env.reset(train_mode=True)[brain_name] # reset the environment
    state = env_info.vector_observations[0]   # get the next state
    agent.reset()
    total_reward = 0
    while True:
        action_logits = agent.act(state)
        action = np.argmax(action_logits) # discrete action space
        #print(action_logits, action)
        #next_state, reward, done, _ = env.step(action)
        env_info = env.step(action)[brain_name]        # send the action to the environment
        next_state = env_info.vector_observations[0]   # get the next state
        reward = env_info.rewards[0]                   # get the reward
        done = env_info.local_done[0]                  # see if episode has finished
        agent.step(state, action_logits, reward, next_state, done)
        state = next_state
        total_reward += reward
        if done:
            break 
    total_reward_deque.append(total_reward)
    total_reward_list.append([ep, np.mean(total_reward_deque)])
    print('\rEpisode {}\tAverage Score: {:.2f}'.format(ep, np.mean(total_reward_deque)))   
    if np.mean(total_reward_deque) >= +13:
        torch.save(agent.actor_local.state_dict(), 'nav-checkpoint_actor.pth')
        torch.save(agent.critic_local.state_dict(), 'nav-checkpoint_critic.pth')
        break



Episode 0	Average Score: 1.00
Episode 1	Average Score: 0.50
Episode 2	Average Score: 0.00
Episode 3	Average Score: -0.75
Episode 4	Average Score: -1.00
Episode 5	Average Score: -0.67
Episode 6	Average Score: -0.57
Episode 7	Average Score: -0.62
Episode 8	Average Score: 0.00
Episode 9	Average Score: -0.10
Episode 10	Average Score: -0.18
Episode 11	Average Score: -0.17
Episode 12	Average Score: -0.08
Episode 13	Average Score: -0.14
Episode 14	Average Score: -0.20
Episode 15	Average Score: -0.25
Episode 16	Average Score: -0.29
Episode 17	Average Score: -0.22
Episode 18	Average Score: -0.21
Episode 19	Average Score: -0.25
Episode 20	Average Score: -0.19
Episode 21	Average Score: 0.05
Episode 22	Average Score: 0.00
Episode 23	Average Score: 0.00
Episode 24	Average Score: 0.08
Episode 25	Average Score: 0.38
Episode 26	Average Score: 0.59
Episode 27	Average Score: 0.61
Episode 28	Average Score: 0.97
Episode 29	Average Score: 1.07
Episode 30	Average Score: 1.19
Episode 31	Average Score: 1.53
E

Episode 259	Average Score: 0.74
Episode 260	Average Score: 0.73
Episode 261	Average Score: 0.74
Episode 262	Average Score: 0.74
Episode 263	Average Score: 0.73
Episode 264	Average Score: 0.76
Episode 265	Average Score: 0.75
Episode 266	Average Score: 0.72
Episode 267	Average Score: 0.72
Episode 268	Average Score: 0.73
Episode 269	Average Score: 0.75
Episode 270	Average Score: 0.75
Episode 271	Average Score: 0.75
Episode 272	Average Score: 0.79
Episode 273	Average Score: 0.79
Episode 274	Average Score: 0.84
Episode 275	Average Score: 0.84
Episode 276	Average Score: 0.86
Episode 277	Average Score: 0.84
Episode 278	Average Score: 0.84
Episode 279	Average Score: 0.78
Episode 280	Average Score: 0.80
Episode 281	Average Score: 0.81
Episode 282	Average Score: 0.80
Episode 283	Average Score: 0.81
Episode 284	Average Score: 0.80
Episode 285	Average Score: 0.78
Episode 286	Average Score: 0.77
Episode 287	Average Score: 0.81
Episode 288	Average Score: 0.83
Episode 289	Average Score: 0.84
Episode 

Episode 516	Average Score: 0.11
Episode 517	Average Score: 0.10
Episode 518	Average Score: 0.10
Episode 519	Average Score: 0.11
Episode 520	Average Score: 0.11
Episode 521	Average Score: 0.10
Episode 522	Average Score: 0.11
Episode 523	Average Score: 0.11
Episode 524	Average Score: 0.10
Episode 525	Average Score: 0.10
Episode 526	Average Score: 0.08
Episode 527	Average Score: 0.06
Episode 528	Average Score: 0.06
Episode 529	Average Score: 0.06
Episode 530	Average Score: 0.04
Episode 531	Average Score: 0.05
Episode 532	Average Score: 0.07
Episode 533	Average Score: 0.07
Episode 534	Average Score: 0.07
Episode 535	Average Score: 0.07
Episode 536	Average Score: 0.08
Episode 537	Average Score: 0.08
Episode 538	Average Score: 0.07
Episode 539	Average Score: 0.07
Episode 540	Average Score: 0.07
Episode 541	Average Score: 0.07
Episode 542	Average Score: 0.07
Episode 543	Average Score: 0.07
Episode 544	Average Score: 0.07
Episode 545	Average Score: 0.06
Episode 546	Average Score: 0.07
Episode 

Episode 773	Average Score: 0.13
Episode 774	Average Score: 0.13
Episode 775	Average Score: 0.13
Episode 776	Average Score: 0.14
Episode 777	Average Score: 0.16
Episode 778	Average Score: 0.16
Episode 779	Average Score: 0.19
Episode 780	Average Score: 0.17
Episode 781	Average Score: 0.16
Episode 782	Average Score: 0.16
Episode 783	Average Score: 0.15
Episode 784	Average Score: 0.15
Episode 785	Average Score: 0.15
Episode 786	Average Score: 0.17
Episode 787	Average Score: 0.16
Episode 788	Average Score: 0.16
Episode 789	Average Score: 0.16
Episode 790	Average Score: 0.17
Episode 791	Average Score: 0.16
Episode 792	Average Score: 0.16
Episode 793	Average Score: 0.16
Episode 794	Average Score: 0.15
Episode 795	Average Score: 0.15
Episode 796	Average Score: 0.17
Episode 797	Average Score: 0.17
Episode 798	Average Score: 0.19
Episode 799	Average Score: 0.20
Episode 800	Average Score: 0.19
Episode 801	Average Score: 0.19
Episode 802	Average Score: 0.19
Episode 803	Average Score: 0.19
Episode 

Episode 1027	Average Score: 0.02
Episode 1028	Average Score: 0.02
Episode 1029	Average Score: 0.03
Episode 1030	Average Score: 0.01
Episode 1031	Average Score: 0.00
Episode 1032	Average Score: -0.01
Episode 1033	Average Score: -0.02
Episode 1034	Average Score: 0.00
Episode 1035	Average Score: 0.01
Episode 1036	Average Score: 0.01
Episode 1037	Average Score: 0.00
Episode 1038	Average Score: -0.01
Episode 1039	Average Score: 0.00
Episode 1040	Average Score: 0.01
Episode 1041	Average Score: 0.02
Episode 1042	Average Score: 0.02
Episode 1043	Average Score: 0.02
Episode 1044	Average Score: 0.02
Episode 1045	Average Score: 0.02
Episode 1046	Average Score: 0.01
Episode 1047	Average Score: 0.01
Episode 1048	Average Score: 0.00
Episode 1049	Average Score: 0.00
Episode 1050	Average Score: 0.00
Episode 1051	Average Score: 0.00
Episode 1052	Average Score: 0.00
Episode 1053	Average Score: 0.01
Episode 1054	Average Score: 0.01
Episode 1055	Average Score: 0.01
Episode 1056	Average Score: 0.01
Episode

Episode 1275	Average Score: -0.02
Episode 1276	Average Score: -0.02
Episode 1277	Average Score: -0.02
Episode 1278	Average Score: -0.02
Episode 1279	Average Score: -0.03
Episode 1280	Average Score: -0.03
Episode 1281	Average Score: -0.03
Episode 1282	Average Score: -0.02
Episode 1283	Average Score: -0.02
Episode 1284	Average Score: -0.01
Episode 1285	Average Score: 0.02
Episode 1286	Average Score: 0.05
Episode 1287	Average Score: 0.02
Episode 1288	Average Score: 0.04
Episode 1289	Average Score: 0.04
Episode 1290	Average Score: 0.05
Episode 1291	Average Score: 0.05
Episode 1292	Average Score: 0.01
Episode 1293	Average Score: 0.01
Episode 1294	Average Score: 0.01
Episode 1295	Average Score: 0.01
Episode 1296	Average Score: -0.02
Episode 1297	Average Score: -0.02
Episode 1298	Average Score: -0.02
Episode 1299	Average Score: -0.01
Episode 1300	Average Score: 0.00
Episode 1301	Average Score: 0.00
Episode 1302	Average Score: -0.01
Episode 1303	Average Score: 0.01
Episode 1304	Average Score: 

Episode 1520	Average Score: 0.30
Episode 1521	Average Score: 0.33
Episode 1522	Average Score: 0.33
Episode 1523	Average Score: 0.33
Episode 1524	Average Score: 0.32
Episode 1525	Average Score: 0.35
Episode 1526	Average Score: 0.35
Episode 1527	Average Score: 0.35
Episode 1528	Average Score: 0.39
Episode 1529	Average Score: 0.34
Episode 1530	Average Score: 0.36
Episode 1531	Average Score: 0.31
Episode 1532	Average Score: 0.31
Episode 1533	Average Score: 0.32
Episode 1534	Average Score: 0.33
Episode 1535	Average Score: 0.33
Episode 1536	Average Score: 0.35
Episode 1537	Average Score: 0.35
Episode 1538	Average Score: 0.35
Episode 1539	Average Score: 0.35
Episode 1540	Average Score: 0.37
Episode 1541	Average Score: 0.39
Episode 1542	Average Score: 0.42
Episode 1543	Average Score: 0.42
Episode 1544	Average Score: 0.44
Episode 1545	Average Score: 0.46
Episode 1546	Average Score: 0.46
Episode 1547	Average Score: 0.44
Episode 1548	Average Score: 0.43
Episode 1549	Average Score: 0.47
Episode 15

Episode 1767	Average Score: -0.43
Episode 1768	Average Score: -0.42
Episode 1769	Average Score: -0.41
Episode 1770	Average Score: -0.35
Episode 1771	Average Score: -0.35
Episode 1772	Average Score: -0.34
Episode 1773	Average Score: -0.33
Episode 1774	Average Score: -0.37
Episode 1775	Average Score: -0.37
Episode 1776	Average Score: -0.36
Episode 1777	Average Score: -0.34
Episode 1778	Average Score: -0.30
Episode 1779	Average Score: -0.27
Episode 1780	Average Score: -0.29
Episode 1781	Average Score: -0.28
Episode 1782	Average Score: -0.28
Episode 1783	Average Score: -0.28
Episode 1784	Average Score: -0.25
Episode 1785	Average Score: -0.24
Episode 1786	Average Score: -0.26
Episode 1787	Average Score: -0.26
Episode 1788	Average Score: -0.25
Episode 1789	Average Score: -0.18
Episode 1790	Average Score: -0.20
Episode 1791	Average Score: -0.18
Episode 1792	Average Score: -0.19
Episode 1793	Average Score: -0.16
Episode 1794	Average Score: -0.18
Episode 1795	Average Score: -0.18
Episode 1796	A

Episode 2009	Average Score: 0.01
Episode 2010	Average Score: 0.01
Episode 2011	Average Score: 0.02
Episode 2012	Average Score: 0.02
Episode 2013	Average Score: 0.03
Episode 2014	Average Score: 0.00
Episode 2015	Average Score: -0.02
Episode 2016	Average Score: -0.02
Episode 2017	Average Score: -0.02
Episode 2018	Average Score: -0.02
Episode 2019	Average Score: -0.02
Episode 2020	Average Score: 0.02
Episode 2021	Average Score: 0.03
Episode 2022	Average Score: 0.02
Episode 2023	Average Score: 0.02
Episode 2024	Average Score: 0.02
Episode 2025	Average Score: 0.00
Episode 2026	Average Score: 0.00
Episode 2027	Average Score: 0.00
Episode 2028	Average Score: 0.01
Episode 2029	Average Score: -0.01
Episode 2030	Average Score: 0.01
Episode 2031	Average Score: 0.01
Episode 2032	Average Score: 0.02
Episode 2033	Average Score: 0.03
Episode 2034	Average Score: 0.05
Episode 2035	Average Score: 0.05
Episode 2036	Average Score: 0.06
Episode 2037	Average Score: 0.06
Episode 2038	Average Score: 0.07
Epis

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [33]:
# # import gym
# # # env = gym.make('CartPole-v0')
# # env = gym.make('CartPole-v1')
# # # env = gym.make('Acrobot-v1')
# # # env = gym.make('MountainCar-v0')
# # # env = gym.make('Pendulum-v0')
# # # env = gym.make('Blackjack-v0')
# # # env = gym.make('FrozenLake-v0')
# # # env = gym.make('AirRaid-ram-v0')
# # # env = gym.make('AirRaid-v0')
# # # env = gym.make('BipedalWalker-v2')
# # # env = gym.make('Copy-v0')
# # # env = gym.make('CarRacing-v0')
# # # env = gym.make('Ant-v2') #mujoco
# # # env = gym.make('FetchPickAndPlace-v1') # mujoco required!

# with tf.Session() as sess:
#     #sess.run(tf.global_variables_initializer())
#     saver.restore(sess, 'checkpoints/model-nav.ckpt')    
#     #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
#     # Episodes/epochs
#     for _ in range(1):
#         state = env.reset()
#         total_reward = 0

#         # Steps/batches
#         #for _ in range(111111111111111111):
#         while True:
#             env.render()
#             action_logits = sess.run(model.actions_logits, feed_dict={model.states: np.reshape(state, [1, -1])})
#             action = np.argmax(action_logits)
#             state, reward, done, _ = env.step(action)
#             total_reward += reward
#             if done:
#                 break
                
#         # Closing the env
#         print('total_reward: {:.2f}'.format(total_reward))
#         env.close()