## Learning Algorithm

The learning algorithm implemented for this assignment is the <a href="https://arxiv.org/pdf/1511.06581.pdf">Dueling Network</a>. The architecture is similiar to a Deep Q-Network, employing the same fundamental tenets such as experience replay and fixed-Q targets. The primary difference is in the estimation of $Q$ values. Rather than estimating $Q(s,a)$  directly, the duelling network seperates the final fully connected linear layer of a a DQN into two streams: one to estimate the state value $V(s)$ and the other to estimate the advantage $A(s,a)$ of taking an action $a$ in state $s$, with $Q(s,a)$ estimated with:

<br>

$$Q(s,a) = V(s) + A(s,a)$$

<br>

The input is a state vector from the Unity ML Engine fed into a deep neural network with the below architecture:

<br>


| Layer | Input  | Output   |   
|:-------|:--------|:----------|
|FC1    |   37 (state space)  |  64       |   
|FC2    |   64   |  64      |   
|Advantage    |   64     |   4 (action space)    |  
|Value      |  64    |   1      |   
    



####  Parameters used for training :

```python

BUFFER_SIZE = int(1e5), # replay buffer size
BATCH_SIZE = 256,        # minibatch size sampled from the replay buffer
GAMMA = 0.99,           # discount factor
TAU = 1e-3,             # for soft update of target parameters
LR =  5e-4,             # learning rate
UPDATE_EVERY = 16        # how often to update the network

```


## Plot of Rewards

Below is a plot of the agents score during training. The agent is able to collect an average reward (over 100 episodes) of at least __+13__ after __641 episodes__. In the code, the agent stops training as soon as it is able to achieve this score, but if we had left it to train longer then it would have achieved a higher score.

<img src="img/training.png">



## Ideas for Future Work

A few things that can be tried to improve model performance are:

- hyperparamer tuning of:


    - number of hidden cells
    - number of hidden layers
    - number of steps before calculating loss and updating the target network
    - type of optimizer/ learning rate
    - batch size
    
- augment the current model with <a href="https://arxiv.org/abs/1511.05952"> prioritized experience replay</a> and <a href="https://arxiv.org/abs/1509.06461"> double DQN </a> 
- train the agent on raw pixels using a __convnet__