## Learning Algorithm

The learning algorithm implemented for this assignment is <a href="https://arxiv.org/pdf/1802.09477.pdf">TD3</a> which addresses the overestimation bias of <a href="https://arxiv.org/abs/1509.02971">DDPG</a> - an off-policy, actor/critic network that inherits concepts from Deep-Q Learning such as _Experience Replay_ and _Fixed Q-Targets_ to train environments with continuous action spaces. The main fundamental difference between TD3 and DDPG is the use of two critic networks to estimate Q-values, using the smaller of the two to form the targets in the Bellman error loss functions. Other differences include:

 - the use of noise to regularize target actions, used by the critic to estimate Q-values
 - delayed update of the actor network
  
For this assignment, I trained an agent using both __DDPG__ and __TD3__ and the learning plots are displayed below. DDPG took ~3500 episodes to complete the task (i.e. achieve an average score of +30 over 100 consecutive episodes) and i found it difficult to settle on good hyper-parameters. Additionally, as can be seen, there is a lot of variance across episodes and the trajectory is not as smooth as TD3 which took ~442 episodes to complete the task! 

<br>



DDPG            |  TD3
:-------------------------:|:-------------------------:
<img src="img/DPDG_Continuous_Control.png"> |  <img src="img/TD3_Continuous_Control.png">


<br>



### Architecture and Hyperparameters


The network architecture and hyperparameters for the TD3 model are below

<br>


<center> Actor </center>

| Layer | Input  | Output   |   
|:-------|:--------|:----------|
|FC1    |   33 (state space)  |  400       |   
|FC2    |   400   |  300      |   
|FC3    |   300     |   4 (action space)    |
    
<br>
<br>

<center> Critic </center>

| Layer | Input  | Output   |   
|:-------|:--------|:----------|
|FC1    |   37 (state space) + 4 (action space)   |  400       |   
|FC2    |   400    |  300      |   
|FC3    |   300     |   1  (Q-value)  |

<br>

####  Parameters used for training :

```python

BUFFER_SIZE = int(1e6)  # replay buffer size
BATCH_SIZE = 512        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 5e-3              # for soft update of target parameters
LR_ACTOR = 1e-5         # learning rate of the actor
LR_CRITIC = 1e-4        # learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay
UPDATE_EVERY = 2        # Steps to update agent

```


## Ideas for Future Work

A few things that can be tried to improve model performance are:

1. Tuning of hyper-parameters:


    - number of hidden cells
    - number of hidden layers
    - actor/ critic learning rates
    - batch size
    - noise regularization
    
2. Try other algorithms such as D4PG, A2C and PPO and see how they compare.