# Project 3: Collaboration and Competition


## Project details:

![title](tennis.png)
Unity ML-Agents Tennis Environment


In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.

The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,

After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
This yields a single score for each episode.
The environment is considered solved, when the average (over 100 episodes) of those scores is at least +0.5.


## Learning Algorithm:

The agent is based on Multi Agent DDPG algorithm. A 'Critic' Neural Network model is used to approximate the state-value function. It trained based on data collected from both players. The two players uses the same 'Actor' Neural Network to generate their own control policy based on the local observation. 

### The network:
#### Actor
Inputs: system states (total of 24 states)
Outputs: value for each action (total of 2 actions)
Hidden Layer 1: fully connected 96 neurons, Relu activation function
Hidden Layer 2: fully connected 96 neurons, Relu activation function

#### Critic
Inputs: system states (total of 24 states)
Outputs: state-function value
Hidden Layer 1: fully connected 96 neurons, Relu activation function
Hidden Layer 2: fully connected 96 neurons, Relu activation function


### The DDPG algorithm
The Multi Agent DDPG utilizes epsilon-greedy method. The epsilon is selected as 1 at the beginning of the training process to allow more exploration. It delays at a rate of 1e-06 to shift the balance to exploitation as the agent gets better. 

Experience Replay method was applied to eliminate data correlation and improve training performance. The experience buffer is selected sufficiently large (1e6) to store enough experiences. The DDPG is updated with mini batch of 256 samples, randomly selected from the experience buffer. 

Soft update technique is applied

### Summary of Hyperparameters

BUFFER_SIZE = int(1e6)  # replay buffer size

BUFFER_FILL = int(1e4) # How much of the buffer should be filled before learning

NUM_UPDATES_CACHE = 2 # How many times to update from cache buffer

BATCH_SIZE = 256        # minibatch size

GAMMA = 0.99            # discount factor

TAU = 1e-3              # for soft update of target parameters

LR_ACTOR = 1e-3         # learning rate of the actor

LR_CRITIC = 1e-3        # learning rate of the critic

WEIGHT_DECAY = 0        # L2 weight decay

UPDATE_EVERY = 20       # timesteps between updates

NUM_UPDATES = 15        # num of update passes when updating

EPSILON = 1.0           # epsilon for the noise process added to the actions

EPSILON_DECAY = 1e-6    # decay for epsilon above


### Plot of rewards:
The following figure shows the score (average of 100 episodes) evolution with training episodes. The average score went above 0.5 at 1956th episode:

![title](reward.png)


## Future improvement:
I am very satisfied with the current Multi Angent DDPG agent's performance. Further investigation will be focused on:

1. Visualize the agent in the Unity environment

