# Navigation

---

You are welcome to use this coding environment to train your agent for the project.  Follow the instructions below to get started!

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [2]:
# !pip -q install ./setup

The environment is already saved in the Workspace and can be accessed at the file path provided below.  Please run the next code cell without making any changes.

In [1]:
from setup import unityagents
from unityagents import UnityEnvironment
import numpy as np

# please do not modify the line below
env = UnityEnvironment(file_name="setup/Banana.app")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [2]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [3]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)

Number of agents: 1
Number of actions: 4
States look like: [1.         0.         0.         0.         0.84408134 0.
 0.         1.         0.         0.0748472  0.         1.
 0.         0.         0.25755    1.         0.         0.
 0.         0.74177343 0.         1.         0.         0.
 0.25854847 0.         0.         1.         0.         0.09355672
 0.         1.         0.         0.         0.31969345 0.
 0.        ]
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Note that **in this coding environment, you will not be able to watch the agent while it is training**, and you should set `train_mode=True` to restart the environment.

In [5]:
"""
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))
"""

'\nenv_info = env.reset(train_mode=True)[brain_name] # reset the environment\nstate = env_info.vector_observations[0]            # get the current state\nscore = 0                                          # initialize the score\nwhile True:\n    action = np.random.randint(action_size)        # select an action\n    env_info = env.step(action)[brain_name]        # send the action to the environment\n    next_state = env_info.vector_observations[0]   # get the next state\n    reward = env_info.rewards[0]                   # get the reward\n    done = env_info.local_done[0]                  # see if episode has finished\n    score += reward                                # update the score\n    state = next_state                             # roll over the state to next time step\n    if done:                                       # exit loop if episode finished\n        break\n    \nprint("Score: {}".format(score))\n'

When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agent while it is training.  However, **_after training the agent_**, you can download the saved model weights to watch the agent on your own machine! 

## Project

### 0. Dependencies

Not trying to be funny with the index zero, I just forgot to add the dependencies and don't want to change the other numbers.

In [11]:
import sys

import numpy as np
from unityagents import UnityEnvironment

# !pip install numpy_ringbuffer
from numpy_ringbuffer import RingBuffer

import torch
import torch.nn.functional as F
import torch.optim as optim
from torch import nn

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### 1. Replay buffer

We'll use an abstract class ReplayBuffer to mark the necessary methods and make two concrete implementations:

- `UniformReplayBuffer`
- `PrioritizedReplayBuffer`

We'll provide one of them to the agent.

In [8]:
class ReplayBuffer():
    """Abstract class for a replay buffer. Concrete implementations are UniformReplayBuffer and PrioritizedReplayBuffer."""
    
    def __init__(self, size):
        self.size = size
        
    def sample(self, n):
        """Returns tuple of arrays ([S_t], [A_t], [R_t], [S_t+1], [done_t+1], [priority])"""
        pass
    
    def add(self, observation):
        """Add tuple of (S_t, A_t, R_t, S_t+1, done_t+1, priority)"""
        pass
    
    def size(self):
        """How many tuples are currently stored"""
        pass
    
    def max_priority(self):
        pass
    
    def update_priority(self, observations):
        pass

In [5]:
class UniformReplayBuffer():
    
    def __init__(self, size):
        self.buff = RingBuffer(capacity=size, dtype=object)  # could be more specific
    
    def sample(self, n):
        samples = np.random.choice(np.array(self.buff), n)
        
        s = torch.FloatTensor([sample[0] for sample in samples])
        a = torch.LongTensor([sample[1] for sample in samples])
        r = torch.FloatTensor([sample[2] for sample in samples])
        ns = torch.FloatTensor([sample[3] for sample in samples])
        d = torch.FloatTensor([sample[4] for sample in samples])
        p = torch.ones(d.size()) / n
        
        return s, a, r, ns, d, p
    
    def add(self, observation):
        s, a, r, ns, d, _ = observation
        p = 1  # uniform priorities
        self.buff.append((s, a, r, ns, d, p))
    
    def size(self):
        return len(self.buff)
    
    def max_priority(self):
        return 0
    
    def update_priority(self, observations):
        pass

rb = UniformReplayBuffer(4)
rb.add((np.array([1, 2, 3]), 2, 0.5, np.array([1, 2, 4]), 0, 0.4))
rb.add((np.array([2, 3, 4]), 1, 0.5, np.array([1, 2, 4]), 0, 0.1))
rb.add((np.array([3, 4, 5]), 0, 0.5, np.array([1, 2, 4]), 0, 1))
rb.add((np.array([4, 5, 6]), 3, 0.5, np.array([1, 2, 4]), 1, 2))
rb.add((np.array([5, 6, 7]), 2, 0.5, np.array([1, 2, 4]), 1, 4))
s, a, r, ns, d, w = rb.sample(4)

In [10]:
class PrioritizedReplayBuffer():
    pass

### 2. Model

Once again we have two choices:

- `DQN`, the original Deep Q-Network
- `DuelingDQN`, the Dueling Deep Q-Network

Both models receive the current state as input and provide an estimate for the corresponding $Q$ values as outputs. Dueling DQN has two streams of independent layers, $V$ and $A$, respectively the value of a state (single value) and the advantage of taking an action (vector of the same size as $Q$), related by $A(s, a) = Q(s, a) - V(s)$. In comparison, DQN has only one stream that directly outputs $Q$.

However, in the dueling architecture, we can't simply add $V$ and $A$ as $Q(s, a) = V(s) + A(s, a)$, because we won't be able to distinguish between them. We can, instead, write

$$Q(s, a) = V(s) + A(s, a) - \max_{a'} A(S, a')$$

so that when the best action $a^*$ is selected, we obtain $Q(s, a^*) = V(s)$, something that is not guaranteed otherwies. To further improve this, we can substitute the $\max$ operator for a mean of the available actions. This is the final equation used in the following model, where $|\mathbb{A}(s)| = 4$.

$$Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathbb{A}(s)|} \sum_{a' \in \mathbb{A}(s)} A(s, a')$$

In [6]:
class DQN(nn.Module):
    
    def __init__(self, state_size, action_size, hidden_layers=[64, 128, 64]):
        super(DQN, self).__init__()
        
        self.hidden_layers = nn.ModuleList([nn.Linear(state_size, hidden_layers[0])])
        
        A = hidden_layers[:-1]
        B = hidden_layers[1:]
        self.hidden_layers.extend([nn.Linear(a, b) for a, b in zip(A, B)])
        
        self.output_layer = nn.Linear(hidden_layers[-1], action_size)

    def forward(self, state):
        for layer in self.hidden_layers:
            state = layer(state)
            state = F.relu(state)
        state = self.output_layer(state)
        return state

In [12]:
class DuelingDQN(nn.Module):
    pass

### 3. Error estimation

The original paper for DQN uses two networks, one `local` and one as a fixed `target`, which changes less frequently. The Temporal-Difference error $\delta_t$ is estimated as

$$\delta_t(s, a, r, s') = r + \gamma \max_{a'} Q_{target}(s', a') - Q_{local}(s, a)$$

However, this can be improved using the Double DQN estimation. Rewriting the $\max$ value using $\arg \max$, we can see that we could choose the action using the `local` network and evaluate it with the `target` network. This will hopefully decrease noise, as both networks have to "agree" on the outcome of actions.

$$\begin{aligned}
\max_{a'} Q_{target}(s', a') = & \ Q_{target}(s', \arg \max_{a'} Q_{target}(s', a'))\\
& \ Q_{target}(s', \arg \max_{a'} Q_{local}(s', a'))
\end{aligned}
$$

The final equation is:

$$\delta_t(s, a, r, s') = r + \gamma Q_{target}(s', \arg \max_{a'} Q_{local}(s', a')) - Q_{local}(s, a)$$

The agent will rely on one of the two functions, `dt_dqn`, `dt_double_dqn`. Both these functions take an extra boolean parameter `d` that signals when the episode is over. In that case, we should only consider the immediate reward, so the TD error simplifies to $\delta_t = r - Q_{local}(s, a)$.

In [7]:
def dt_dqn(s, a, r, ns, d, q_local, q_target, gamma):
    """
    QL [5 x 4]
               a_0      a_1       a_2     a_3
    tensor([[ 0.0060, -0.0931,  0.0496, -0.0626],    # 0 in the batch
            [ 0.0139, -0.0952,  0.0470, -0.0571],    # 1 
            [ 0.0024, -0.0916,  0.0288, -0.0674],    # 2
            [-0.0207, -0.1126,  0.0453, -0.1079],    # 3
            [ 0.0089, -0.1415,  0.0422, -0.1021]])   # 4
    
    a [5] 
    tensor([1, 3, 1, 0, 2])
    
    a.unsqueeze(1) [5 x 1]
    tensor([[ 1],
            [ 3],
            [ 1],
            [ 0],
            [ 2]])
    
    QL.gather(1, a.unsqueeze(1)) [5 x 1]
    tensor(1.00000e-02 *
           [[-9.3088],
            [-5.7069],
            [-9.1603],
            [-2.0720],
            [ 4.2235]])
    
    QL.squeeze [5]
    tensor([-0.0931, -0.0571, -0.0916, -0.0207, 0.0422])
    """
    
    with torch.no_grad():
        QT = q_target(ns).max(1)[0]
    QL = q_local(s).gather(1, a.unsqueeze(1)).squeeze(1)    
    return r + gamma * QT * (1 - torch.FloatTensor(d)) - QL

In [14]:
def dt_double_dqn(s, a, r, ns, d, q_local, q_target, gamma):
    pass

### 4. Agent

Let's put the pieces together.

In [8]:
class QNetworkAgent():
    
    def __init__(self, QNetwork, state_size, action_size, 
                 replay_buffer, Delta, 
                 eps=1, eps_decay=0.9995, min_eps=0.0001, gamma=0.99, 
                 alpha=0.001, tau=0.01,
                 update_every=4, batch_size=64, learning=True):
        self.state_size = state_size
        self.action_size = action_size
        self.q_local = QNetwork(state_size, action_size)
        self.q_target = QNetwork(state_size, action_size)
        self.q_target.eval()
        self.replay_buffer = replay_buffer
        self.Delta = Delta
        
        self.optimizer = optim.Adam(self.q_local.parameters(), lr=alpha)
        
        self.eps = eps
        self.eps_decay = eps_decay
        self.min_eps = min_eps
        
        self.gamma = gamma
        
        self.alpha = alpha
        self.tau = tau
        
        self.update_every = update_every
        self.update_i = 0
        
        self.learning = learning
        
        self.batch_size = batch_size

    def set_learning(self, learning=True):
        self.learning = learning
        
    def act(self, s):
        if not self.learning or np.random.uniform() > self.eps:
            with torch.no_grad():
                s = torch.FloatTensor(s).unsqueeze(0)
                return int(self.q_local(s).max(1)[1])
        else:
            return np.random.randint(self.action_size)
    
    def store(self, s, a, r, ns, d):
        p = self.replay_buffer.max_priority()
        self.replay_buffer.add((s, a, r, ns, d, p))
        if self.update_i == 0 and self.replay_buffer.size() >= self.batch_size:
            self.learn()
        self.update_i = (self.update_i + 1) % self.update_every
        self.eps = max(self.eps * self.eps_decay, self.min_eps)
    
    def learn(self):
        s, a, r, ns, d, w = self.replay_buffer.sample(self.batch_size)
        td_delta = self.Delta(s, a, r, ns, d, self.q_local, self.q_target, self.gamma)
        self.replay_buffer.update_priority(zip(s, a, r, ns, d, torch.abs(td_delta)))
                
        self.optimizer.zero_grad()
        loss = torch.sum(w * (td_delta ** 2))  # weighted mse
        loss.backward()
        self.optimizer.step()
        
        with torch.no_grad():
            for local, target in zip(self.q_local.parameters(), self.q_target.parameters()):
                target.copy_(target + self.tau * (local - target))
        
        

In [9]:
agent = QNetworkAgent(
    DQN,
    state_size, action_size,
    UniformReplayBuffer(100_000),
    dt_dqn
)

In [12]:
last100 = []
for i in range(1, 101):
    env_info = env.reset(train_mode=True)[brain_name]
    state = env_info.vector_observations[0]
    score = 0
    done = False
    while not done:
        action = agent.act(state)

        env_info = env.step(action)[brain_name]
        next_state = env_info.vector_observations[0]
        reward = env_info.rewards[0]
        done = env_info.local_done[0]

        agent.store(state, action, reward, next_state, done)
        score += reward
        state = next_state

    # if i % 10 == 0:
    
    last100.append(score)
    if len(last100) > 100:
        last100.pop(0)
    
    average = np.mean(last100)
    print("\rEpisode: {}, Score: {}, Average: {}".format(i, score, average), end="")
    sys.stdout.flush()

    if average > 13:
        print("\rSolved in {} episodes. Average: {}".format(i, average), end="")
        sys.stdout.flush()
        break
    
# print("Score: {}".format(score))

Episode: 100, Score: 16.0, Average: 5.59484848484843557

In [19]:
# torch.save(agent.q_local.state_dict(), 'checkpoint.pth')
# agent.q_local.load_state_dict(torch.load('checkpoint.pth', map_location='cpu'))

In [13]:
env_info = env.reset(train_mode=False)[brain_name]
agent.set_learning(False)
state = env_info.vector_observations[0]
score = 0
done = False
while not done:
    action = agent.act(state)

    env_info = env.step(action)[brain_name]
    next_state = env_info.vector_observations[0]
    reward = env_info.rewards[0]
    done = env_info.local_done[0]

    agent.store(state, action, reward, next_state, done)
    score += reward
    state = next_state
print(score)

7.0


In [14]:
env.close()