# Navigation


### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
env = UnityEnvironment(file_name="Banana.app")

KeyboardInterrupt: 

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [None]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [None]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)

### 3. Let's start !



In [None]:
# some needed libraries
import random
from collections import namedtuple, deque

#### a) isolate some constants and the hyperparameters

In [None]:
#constants and hyperparams
# test with changing G,T,LR mainly, eps values pretty standard for this kind of task, experiment maybe

BUFFER_SIZE = int(1e5)  # replay buffer
BATCH_SIZE = 64         # training batch size
G = 0.99                # gamma - discount factor
T = 2e-3                # tau - soft update factor
LR = 7e-4               # learning rate
UPDATE_FREQUENCY = 4    # how often to update the network
N_EPISODES = 1000       # must solve within this window
EPS_START = 1.          # epsilon greedy params: start, end decay
EPS_END = 0.01
EPS_DECAY = 0.995
MAX_STEPS = 1000        # don't wait until env return done, exit training loop if not done within MAX_STEPS
WINDOW_LENGTH = 100     # average score over last x episodes

#### b) describe the blueprint of the deep neural network that will learn how to guide the agent

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# models the deep learning network that is used as a function approximator for choosing the action
# experiment with number of layers and layer size to optimize convergence
# also experiment with externalizing the network shape to the caller layer -> later

class Network(nn.Module):

    def __init__(self, state_size, action_size, seed):
        super(Network, self).__init__()
        self.seed = torch.manual_seed(seed)
        self.fullyconnected1 = nn.Linear(state_size, 128)
        self.fullyconnected2 = nn.Linear(128, 64)
        self.fullyconnected3 = nn.Linear(64, 32)
        self.fullyconnected4 = nn.Linear(32, action_size)

    def forward(self, state):
        x = F.relu(self.fullyconnected1(state))
        x = F.relu(self.fullyconnected2(x))
        x = F.relu(self.fullyconnected3(x))
        return self.fullyconnected4(x)
    

#### d) Definiton of the Replay memory buffer

In [None]:
class ReplayMemory(object):

    def __init__(self, action_size, buffer_size, batch_size, seed):
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple("Exp",field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        experiences = random.sample(self.memory, k=self.batch_size)
        # build the output tuple, make sure no null values
        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to("cpu")
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to("cpu")
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to("cpu")
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to("cpu")
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to("cpu")
        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        return len(self.memory)


#### e) Definition of the agent itself

In [None]:
class DQAgent():

    def __init__(self, state_size, action_size, seed):
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(seed)
        #replace cpu with gpu is system permits
        self.qnet = Network(state_size, action_size, seed).to("cpu")
        self.target_qnet = Network(state_size, action_size, seed).to("cpu")
        self.optimizer = optim.Adam(self.qnet.parameters(), lr=LR)

        self.memory = ReplayMemory(action_size, BUFFER_SIZE, BATCH_SIZE, seed)
        self.time_step = 0

    def step(self, state, action, reward, next_state, done):
        self.memory.add(state, action, reward, next_state, done) # append state to memory
        self.time_step = self.time_step + 1
        if (self.time_step % UPDATE_FREQUENCY) == 0: # reached an update point, time to learn
            self.time_step = 0
            if len(self.memory) > BATCH_SIZE: # train
                experiences = self.memory.sample() # get a random sample from memory for training
                self.learn(experiences, G)
                
    # get next choosen action based on current training
    def act(self, state, eps=0.):
        state = torch.from_numpy(state).float().unsqueeze(0).to("cpu")
        self.qnet.eval()
        with torch.no_grad():
            action_values = self.qnet(state)
        self.qnet.train()
        if random.random() > eps:
            return np.argmax(action_values.cpu().numpy()) # exploatation
        else:
            return random.choice(np.arange(self.action_size)) # exploration

    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences
        max_Q = self.target_qnet(next_states).detach().max(1)[0].unsqueeze(1)
        Q_target = rewards + (gamma * max_Q * (1 - dones))
        Q_expected = self.qnet(states).gather(1, actions)
        loss = F.mse_loss(Q_expected, Q_target) # calculate loss
        self.optimizer.zero_grad()  # reset gradients and train
        loss.backward() # backward prop
        self.optimizer.step()
        self.update(self.qnet, self.target_qnet, T)

    def update(self, local_model, target_model, tau):
        params = zip(target_model.parameters(), local_model.parameters())
        for target_param, local_param in params:
            tensor_aux = tau*local_param.data + (1.0-tau)*target_param.data
            target_param.data.copy_(tensor_aux)


#### f) Finally, we can start the training

In [None]:
#START THE TRAINING
# reset the environment, set training mode
env_info = env.reset(train_mode=True)[brain_name]
# initialize values
action_size = brain.vector_action_space_size
state_size = len(state)

##### - set initial values

In [None]:
scores = []                                # scores from each episode
last_scores = deque(maxlen=WINDOW_LENGTH)  # for saving last x scores
eps = EPS_START                            # epsilon greedy              

##### - get an agent running

In [None]:
#instantiate the agent
agent = DQAgent(state_size=state_size, action_size=action_size, seed=0)

##### - main training loop

In [None]:
#start training loop
for episode in range(1, N_EPISODES+1):
    env_info = env.reset(train_mode=True)[brain_name] 
    state = env_info.vector_observations[0]            
    score = 0                                          
    for t in range(MAX_STEPS):
        action = agent.act(state, eps)                 # get next proposed action from the agent
        env_info = env.step(action)[brain_name]        # execute action
        next_state = env_info.vector_observations[0]   # get env status
        reward = env_info.rewards[0]
        score += reward
        done = env_info.local_done[0]                  
        agent.step(state, action, reward, next_state, done) # propagate environment response to the DQN agent
        state = next_state                             
        if done:                                       # exit loop if episode finished
            break
    last_scores.append(score)       # save most recent score
    scores.append(score)              
    eps = max(EPS_END, EPS_DECAY*eps) 
    print('\rEpisode {} - average score: {:.2f}'.format(episode, np.mean(last_scores)), end="")
    if np.mean(last_scores)>=13.0:
        print('\nSuccess!! Environment solved in {:d} episodes, average score: {:.2f}'.format(episode, 
                                                                                              np.mean(last_scores)))
        torch.save(agent.qnet.state_dict(), 'checkpoint.pth')
        break

##### - plot the agent evolution ( scores )

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
# plot the training session
s = np.array(scores)
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(s)), s)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

In [None]:
env.close() # close the environment