## Learning Algorithm

<a href="https://arxiv.org/pdf/1706.02275.pdf"> MADDPG </a> (Multi Agent Deep Determinstic Policy Gradients) extends <a href="https://arxiv.org/abs/1509.02971">DDPG</a> by introducing the concept of __Centralized Training__ and __Decentralized Execution__ where policies use extra information from each other during training,  but are not reliant on this information during execution. In essence, as depicted below, the critic is
augmented with extra information about the policies of other agents, while the actor only has access
to local information. 
    

<img src="./img/maddpg.png" width=450, height=450>
<center><a href="https://arxiv.org/pdf/1706.02275.pdf">source</a> </center>
  



## Neural Network Architecture

The network architecture and hyperparameters used for the agents are below:

<br>


<center> Actor </center>

| Layer | Input  | Output   |   
|:-------|:--------|:----------|
|FC1    |   24 (state space)  |  64       |   
|FC2    |   64   |  64      |   
|FC3    |   64     |   2 (action space)    |
    
<br>
<br>

<center> Critic </center>

| Layer | Input  | Output   |   
|:-------|:--------|:----------|
|FC1    |   24 (state space)   |64|   
|FC2    |   64 + 4 (action space) |64|   
|FC3    |   64     |   1  (Q-value)  |

<br>

##  Parameters used for training :

```python

CONFIG = {
    "BUFFER_SIZE": int(1e6),     # replay buffer size
    "BATCH_SIZE": 512,           # minibatch size
    "GAMMA": 0.95,               # discount factor
    "TAU": 1e-2,                 # for soft update of target parameters
    "LR_ACTOR": 1e-3,            # learning rate of the actor
    "LR_CRITIC": 1e-3,           # learning rate of the critic
    "WEIGHT_DECAY": 0,           # L2 weight decay
    "SIGMA": 0.001,               # std of noise
    "CLIP_GRADS": True,          # Whether to clip gradients
    "CLAMP_VALUE": 0.5,          # Clip value
    "FC1": 64,                   # First linear layer size
    "FC2": 64,                   # Second linear layer size
    "WARMUP": 0,                 # number of warmup steps
}

```

## Plot of Rewards

Below is a plot of the agent's score during training. The agent is able to collect an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents) in ~1170 steps. In the code, the agent stops training as soon as it is able to achieve this score, but if we had left it to train longer then it would have achieved a higher score.

<img src="img/tennis_maddpg.png" width=450, height=450>



## Ideas for Future Work

A few things that can be tried to improve model performance are:

- Tuning of hyper-parameters:

    - number of hidden cells
    - number of hidden layers
    - actor/ critic learning rates
    - batch size
    - noise regularization
    
- Implement Policy Ensembles as suggested in the original paper
- Explore the idea of incorporating Priortized Experience Replay as suggested in this <a href="https://cardwing.github.io/files/RL_course_report.pdf">paper </a>
