# Problem 1 (Banana Collector) Report

## Content

In this report you will find the following sections:

* Introduction
* Algorithm
* Model architecture
* Training procedure
* Agent performance

## Introduction

This report explains the implementation and training of a RL based agent that is able to solve the Banana Collector problem, please see the README.md file for the problem details.

## Algorithm

In order to solve the problem the RL Agent has been trained with the DQN algorithm using Experience Replay and Fixed Q-Targets techniques.

### DQN Algorithm

The general expression of the DQN algorith is:

![DQN Algorithm](images/dqn_algorithm.png)

Please, read the [research paper](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) for the complete details of the algorithm.

It is also important to understand the two key techniques applied in order to improve the algorithm performance.

#### Experience Replay

When the agent interacts with the environment, the sequence of experience tuples can be highly correlated. The naive Q-learning algorithm that learns from each of these experience tuples in sequential order runs the risk of getting swayed by the effects of this correlation. By instead keeping track of a replay buffer and using experience replay to sample from the buffer at random, we can prevent action values from oscillating or diverging catastrophically.

The replay buffer contains a collection of experience tuples $({S}, {A}, {R}, {S'})$. The tuples are gradually added to the buffer as we are interacting with the environment.

#### Fixed Q-Targets

In Q-Learning, we update a guess with a guess, and this can potentially lead to harmful correlations. To avoid this, we can update the parameters ww in the network $\hat{q}$ to better approximate the action value corresponding to state SS and action AA with the following update rule:

![DQN Algorithm](images/fixed_q-targets.png)

where ${w^-}$ are the weights of a separate target network that are not changed during the learning step, and $({S}, {A}, {R}, {S'})$ is an experience tuple.


## Model Architecture

The network architecture used for the DQN Algorithm is composed by three fully connected layers, concretelly:

```python
self.fc1 = nn.Linear(state_size, fc1_units)
self.fc2 = nn.Linear(fc1_units, fc2_units)
self.fc3 = nn.Linear(fc2_units, action_size)
```

where:

state_size = 37  
fc1_units = 64  
fc2_units = 64  
action_size = 4  

The activation function selected in this case is ReLU:

```python
def forward(self, state):
    x = F.relu(self.fc1(state))
    x = F.relu(self.fc2(x))
    return self.fc3(x)
```





## Training Procedure


In order to train and/or run the Agent, just follow the **Navigation.ipynb** Jupyter Notebook.  

When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:

```python
env_info = env.reset(train_mode=True)[brain_name]
```

For this training the hyperparameters selected are:

n_episodes = 2000 (actually not needed)    
max_t = 1000  
eps_start = 1.0  
eps_end = 0.01  
eps_decay = 0.995  
fc1_units = 64  
fc2_units = 64  
buffer_size = 10000 (replay buffer size)  
batch_size = 64  
gamma = 0.99  
tau = 0.001  
lr = 0.0005  
update_every = 4 (how often to update the network)  

## Agent Performance

The agent solves the problem in 389 episodes (basically because the high number of the max_t parameter).

![Agent Performance](images/agent_performance.png)
