# DRLND - P3 - Collaboration and Competition : Report
________________________________________________________________________

In this report, I am going to present about the environment and the algorithm that I have used to solve collaboration and competition problem where the agents must bounce ball back and forth while not dropping or sending ball out of bounds.

## Environment

We work with Unity ML Environment, [Tennis](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#tennis), in this project.

In this environment, there are two agents which control rackets to bounce a ball over a net. The agent receives +0.1 reward if it manages to hit ball over the net without dropping or hitting out of bounds. For dropping or hitting out of bounds, it receives -0.01 . 

The observation space consists of 24 variables representing position and velocity of the ball and racket. Each action is a vector with two numbers, corresponding to movement towards or away from the net, and jumping. The action vector should be a number between 1 and -1.

The environment is deemed solved if the agents get an average score of +0.5 over 100 consecutive episodes. 

Given below are the characteristics of the agents and the environment.

* Unity brain name: TennisBrain
* Number of Visual Observations (per agent): 0
* Vector Observation space type: continuous
* Vector Observation space size (per agent): 8
* Number of stacked Vector Observation: 3
* Vector Action space type: continuous
* Vector Action space size (per agent): 2



## Learning Algorithm

To solve this reinforment learning problem, I am using a Deep Deterministic Policy Gradients (DDPG) with modification to make it suitable for multiagent environment.

### Deep Deterministic Policy Gradient (DDPG)

**DDPG** is an actor-critic algorithm that extends **DQN** to work in continuous spaces. Here, we use two deep neural networks, one as actor and the other as critic. Similar network architectures are used for both actor and critic. **ADAM** optimizer is used with **learning rates 0.0001** and **0.0001** for actor and critic, respectively. And the **discount factor** used is **0.99**.

```python
GAMMA = 0.99            # discount factor
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-4        # learning rate of the critic
```
##### Neural Network Architecture

State --> BatchNorm --> 400 --> ReLU --> 300 --> ReLU --> tanh --> action

##### Pytorch Implementation
```python
    self.bn1 = nn.BatchNorm1d(state_size)   
    self.fc1 = nn.Linear(state_size, 400)
    self.fc2 = nn.Linear(400, 300)
    self.fc3 = nn.Linear(300, action_size)

    ...

    state = self.bn1(state)
    x = F.relu(self.fc1(state))
    x = F.relu(self.fc2(x))
    x = F.tanh(self.fc3(x))
```

#### Experience Replay

We store the last one million experience tuples (S,A,R,S') into a data container called **Replay Buffer** from which we sample **a mini batch of 128** experiences. This batch ensures that the experiences are independent and stable enough to train the network.

```python
    BUFFER_SIZE = int(1e6)  # replay buffer size
    BATCH_SIZE = 128       # minibatch size
```

#### Soft Target Updates

In order to calculate the target values for both actor and critic networks, we use **Soft Target Update** strategy. 

```python
    TAU = 1e-3              # for soft update of target parameters
```

## Plot of Rewards

After tuning the hyperparameters, I could solve the problem in **1191 episodes**. The plot below shows the rewards per episode and the target.


![Plot of Rewards](plot.png)

Trained models can be found [here](checkpoint_actor.pth) and [here](checkpoint_critic.pth). 

## Ideas for Future Work

* Multi-Agent DDPG would be apt for environment like [Soccer](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#soccer-twos). Implementing it would be an idea for improvement