# Collaboration and competition project report
## Introduction

In this project the Tennis environment is used to train two agents to cooperate. 
In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation.  Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,

- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
- This yields a single **score** for each episode.

The environment is considered solved, when the average (over 100 episodes) of those **scores** is at least +0.5.

A sample of the untrained agents looks like:

[<img src="images/tennis_untrained.gif">]()




## Approach to the problem

The environment is a multiagent one where the cooperation of agents is required to achieve the target. The agents are implementing the DDPG algorithm and each experience is added to a common buffer. 


### Agent
Each of the agents consists of an actor and a critic, both of which are implemented as Q networks with local and target networks.

The noise process is a Ornstein-Uhlenbeck one where the σ value has been changed and proved during the training to be an important parameter to achieve solution.

### Structure of the solution
The solution is implemented in `tennis.py` file. The structure of the project is the follwoing:

- `tennis.py`: Training implementation
- `code/maddpg.py`: Algorithm implementation for the multiagent environment
- `code/ddpg.py`: Single agent implementation
- `code/model.py`: Model neural network definition of the agents
- `code/utils.py`: Definition of noise and replay buffer required for the agents
- `models/*`: Saved models that achieve the target score
- `tennis_untrained.py`: Shows the untrained agents interacting (helper file to capture video)
- `tennis_trained.py`: Shows the trained agents interacting (helper file to capture video)


### Neural network model
The Neural Network models for the Actor and the Critic are 3-hidden layer fully connected layer neural networks, consisting of `512`, `256` and `128` units respectively. The output layer for the actor is `tanh` and the outputs are clipped in the range `[-1, +1]`. Other values can also achieve the result therefore there is a wide range of number to try.

### Hyperparameters
The following hyperparameters are set 

| Name        | Value |
|-------------|-------|
| BUFFER_SIZE | 1e5   |
| BATCH_SIZE  | 256   |
| GAMMA       | 0.99  |
| TAU         | 1e-3  |
| LR_ACTOR    | 3e-5  |
| LR_CRITIC   | 1e-4  |
| NOISE σ     | 0.11  |



## Training and plot of rewards
The training process is set to 3000 episodes. When the target score of `+0.5` is first achieved the agent networks are saved, Then in order to achieve optimal results, each time the score is achieving +0.01 better than the maximum result, the agent networks are overwritten to the new value. The ouput of the program is:


``` 
Episode 100     Average Score: 0.00     Score: -0.00
Episode 200     Average Score: 0.00     Score: -0.00
Episode 300     Average Score: 0.00     Score: -0.00
Episode 400     Average Score: 0.00     Score: -0.00
Episode 500     Average Score: 0.01     Score: 0.050
Episode 600     Average Score: 0.00     Score: -0.00
Episode 700     Average Score: 0.00     Score: -0.00
Episode 800     Average Score: 0.01     Score: -0.00
Episode 900     Average Score: 0.00     Score: -0.00
Episode 1000    Average Score: 0.02     Score: -0.00
Episode 1100    Average Score: 0.03     Score: -0.00
Episode 1200    Average Score: 0.02     Score: 0.050
Episode 1300    Average Score: 0.02     Score: -0.00
Episode 1400    Average Score: 0.01     Score: -0.00
Episode 1500    Average Score: 0.04     Score: -0.00
Episode 1600    Average Score: 0.06     Score: -0.00
Episode 1700    Average Score: 0.11     Score: 0.050
Episode 1800    Average Score: 0.42     Score: 0.300
Episode 1819    Average Score: 0.50     Score: 0.800
 Agents saved for score 0.50
Episode 1821    Average Score: 0.52     Score: 0.75
 Agents saved for score 0.51
Episode 1822    Average Score: 0.52     Score: 0.80
 Agents saved for score 0.52
Episode 1823    Average Score: 0.53     Score: 0.80
 Agents saved for score 0.53
Episode 1827    Average Score: 0.54     Score: 0.75
 Agents saved for score 0.54
Episode 1829    Average Score: 0.55     Score: 0.70
 Agents saved for score 0.55
Episode 1845    Average Score: 0.56     Score: 0.800
 Agents saved for score 0.56
Episode 1900    Average Score: 0.54     Score: 0.150
Episode 2000    Average Score: 0.33     Score: 0.250
Episode 2100    Average Score: 0.53     Score: 0.750
Episode 2111    Average Score: 0.57     Score: 0.75
 Agents saved for score 0.57
Episode 2114    Average Score: 0.58     Score: 0.80
 Agents saved for score 0.58
Episode 2119    Average Score: 0.59     Score: 0.60
 Agents saved for score 0.59
Episode 2164    Average Score: 0.60     Score: 0.800
 Agents saved for score 0.60
Episode 2178    Average Score: 0.61     Score: 0.800
 Agents saved for score 0.61
Episode 2200    Average Score: 0.56     Score: 0.800
Episode 2300    Average Score: 0.52     Score: 0.700
Episode 2400    Average Score: 0.57     Score: 0.100
Episode 2500    Average Score: 0.38     Score: 0.500
Episode 2600    Average Score: 0.55     Score: 0.800
Episode 2700    Average Score: 0.52     Score: 0.500
Episode 2800    Average Score: 0.60     Score: 0.600
Episode 2900    Average Score: 0.56     Score: 0.750
Episode 3000    Average Score: 0.56     Score: 0.800
```

The first time the target of `+0.5` is achieved is at the episode `1819`. Then the agents are further improving achieving a score of `+0.61` for which the agents are saved. 
Graphically, the score evolution is depicted in the following plot, where the blue line shows the result for every epeisode and the red line is the average of the last `100` episodes. 
After initially achieving the target, there are some bumps where the performance falls below `+0.5` but the agents are recovering staying above the target for most of the time. 



[<img src="images/scores.png">]()



## Example of trained agents
After the training the agents cooperate quite well.


[<img src="images/tennis_trained.gif">]()


## Ideas for Future Work

The performance of the agents in this multiagent environment looks satisfactory. Some further research could be done in selecring the hyperparameters and achieve the required performance faster. That would also mean choosing a different network architecture. 

Moreover, this environment is purely cooperative. As a next step, the algorithm should be tried in a competitive or a hubrid enviroment such as the soccer one, consisting of two teams of two players each. 