# Deep Reinforcement Learning Nanodegree: Project 3 - Collaboration and Competition - Report


### 1. General:

The goal of this project was to train two agents to control rackets to bounce a ball over a net. In order to maximize the reward, the agents need to learn how to hit the ball over the net and also how to avoid to let the ball hit the ground or fly out of bounds.

[//]: # (Image References)

<br>
Random Agent:

[image1]: https://raw.githubusercontent.com/cpow-89/Deep_Reinforcement_Learning_Nanodegree_Project_3_Collaboration_and_Competition/master/images/untrained_agent.gif?token=AmwnwtKri9y4IVzUu-oIN1yMjon4U0fIks5b4fCYwA%3D%3D "Random Agent"

![Random Agent][image1]

### 2. Learning algorithm

General Information:

The used learning algorithm is a simplified version of the deep genetic algorithm introduced in the paper
 "Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative forTraining Deep Neural Networks for Reinforcement Learning" by Uber AI Labs. My version does not implement the multiprocessing component of this paper. I also did not implement Novelty Search cause the basic algorithm was already able to solve the task.

Algorithm:

1. Initialization:
- create a initial population $P$ of $N$ individuals
    - in our case, an individual is defined as the weights $\theta$ of a neural network 
    - every individual is evaluated based on a fitness function 
    - we save one individual as a tuple in the form of ($\theta$, fitness_score)
    
2. Evolution:
- sort current population by fitness
- evolves a population $P$ of $N$ individuals 
- generate the next generation
    - the best individual from the current generation is copied unchanged to the next generation
        - this technique is called elitism
        - to more reliably try to select the true elite, we evaluate each of the top n individuals per generation on x additional episodes and calculate the mean reward
        - the best individual is then selected as elite
        - n is defined in config as "parent_count"
        - x is defined in config as "elite_evaluation_count"
    - evaluate the parents for the next generation via truncation selection 
        - the top $T$ individuals become the parents of the next generation
        - we select uniformly at random a parameter $\theta_i$ from the selected parents
        - we mutated the selected parent by applying additive Gaussian noise $\delta$ to the parameter $\theta_i$ 
            - $\theta' = \theta + \delta$ 
        - repeated $N - 1$ times
- repeat the process until task is solved


### 3. Hyperparameters
- hyperparameters can be found in the config file

    
observation_size: 48
- agent observation size is 48 cause the policy network gets the input of 2 agents with respectively 24 input signals

action_size: 4
- agent action size is 4 cause the policy network controls two agents with respectively 2 action signals

fc_units: [ 64, 64, 64 ]
- the policy network is build up of 3 feed-forward layers
- the list determines the number of hidden nodes for each layer
- the number of nodes was chosen experimentally

activation_funcs: [ "Tanh", "Tanh", "Tanh" ]
- the policy network is build up of 3 feed-forward layers
- the list determines the type of activation function to be used for the output of any layer
- Tanh was used cause the agents need to make continuous actions in a range of -1.0 to 1.0

noise_std: 0.01
- the standard deviation for the Gaussian noised used to mutated the population policy network parameters

population_size: 50
- number $n$ of policy networks to create to build a population $P$ for the deep genetic algorithm to operate on
- the value was chosen to be large enough to get a reasonably great variance in the population but also be not to heavy on the computational side

parent_count: 10
- number $T$ of top performing individuals that become the parents of the next generation

elite_evaluation_count: 5
- number of evaluations made for every top $T$ individual during elitism
- in the original paper, this number is set to 30, but this was way too time-consuming for this a non-multiprocessing version

### 4. Network architectures

ContinuousPolicyNetwork(<br>
&nbsp;&nbsp;&nbsp;&nbsp;(network): Sequential(<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(0): Linear(in_features=48, out_features=64, bias=True)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(1): Tanh()<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2): Linear(in_features=64, out_features=64, bias=True)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(3): Tanh()<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(4): Linear(in_features=64, out_features=4, bias=True)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5): Tanh()<br>
&nbsp;&nbsp;&nbsp;&nbsp;)<br>
)<br>

### 5. Results

Environment solved in 106 generations.    Average reward (over 100 episodes): 0.7

To get a better view of the training session statistic open the tensorboard log with the following console command:
tensorboard --logdir=monitor/Tennis_Linux/2018_10_29__21_37_16

[image2]: https://raw.githubusercontent.com/cpow-89/Deep_Reinforcement_Learning_Nanodegree_Project_3_Collaboration_and_Competition/master/images/reward%20plot.png?token=AmwnwqPxOkm7MIWmWfE2gtiHLjrcoh1eks5b4edqwA%3D%3D "Reward Plot"
![Trained Agent][image2]

- i stopped training before the reward function reached the maximum cause of long computational time



Trained Agent:

[image3]: https://raw.githubusercontent.com/cpow-89/Deep_Reinforcement_Learning_Nanodegree_Project_3_Collaboration_and_Competition/master/images/trained_agent.gif?token=Amwnwiow3R4NuNeRDGtNGVjxiVDyXnWEks5b4e96wA%3D%3D "Trained Agent"
![Trained Agent][image3]



### 6. Ideas for Future Work
- add Novelty Search to get more variance into the population to avoid local minima
- add multiprocessing -> currently not possible for the given unity environment provided by udacity
    - at least the linux version of the environment was not multiprocessing ready
    - this seems to be a known issue: https://github.com/Unity-Technologies/ml-agents/issues/956
- try different selection functions like "tournament"