# Report
## Algorithm
For the algorithm a standard DQN was chosen with a seperate target network and replay memory according to the original [DQN paper](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf). Additionally this algorithm implemented batch learning. The official [pytorch documentation](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html) was chosen as a tutorial and adjusted to this environment.
## Neural Network
The neural network used has an input layer of size 37 for the 37 observations the environment returns, an output layer of size 4 for the 4 available actions and a one or multiple variable sized hidden layers.
### Hidden layers
The task should not require more than one or two hidden layers. Both one and two layer neural networks achieved an average reward of +10 over 100 consecutive episodes, however the two layer network only achieved these performances after 1000 episodes while the one layer only took 200. Additionally the two layer network had a disk size of 1MB while the one layer network only requires 20KB. So in conclusion the single layer network is preferable.
### Number of neurons
The standard number of neurons choosen for the hidden layer was 128, a lower number of neurons seemed to have negative effects on the agents performance.
## Agent
### batch size
The batch size describes how many experiences the agent will pull out of memory to learn from at a time. Organizing the learning in batches ensures that the agent learns from a batch of experiences that represent the environment as whole instead of a single experience that might be highly unlikely in general. A batch size of 132 was chosen for most attempts.
### gamma
Gamma is the discount rate and indicates how strongly the agent cares about the future. As this is a task with dense reward a smaller gamma should suffice, because the agent can usually reach a reward with a few actions. Both 0.9 and 0.99 were tested.
### epsilon
Without random exploration the agent would become too scared of blue bananas and instead of moving turn in circles. To encourage it to find the reward the yellow bananas give random exploration is needed. A small put constant value seems to be perfect for this environment.
### epsilon decay
Epsilon decay is useful in many environments where the agent needs to initially explore the environment to find an optimal policy. However in this case the environment states are not highly dependent on each other, so initial exploration is not needed. Testing showed that epsilon decay was counterproductive in the beginning and the agent only really started to learn once epsilon decayed to it's final value. See [model 1](#model1).

In the beginning of this run the agent didn't manage to score any points, however once epsilon decayed it was able to quickly learn. Because of these insights epsilon decay was removed from future runs.
### tau
Tau describes the rate at which the target network converges towards the policy network. A value of 0.005 was chosen.
### learning rate
The learning rate describes how much the agent will adjust its policy in order to fit to the new experiences. Setting it too high will result in the agent locking in on one strategy too quickly without considering more optimal policies or even being to scared of blue bananas to avoid punishment. Setting the learning rate too low will result in a longer training period. Learning rates of both 0.001 and 0.0001 were tested.
### memory size
Setting the memory size too low resulted in the agent eventually forgetting the most basic things he learnt in the beginning, like avoiding blue bananas. The initial size of 10000 led to decreased performance in later episodes. A memory size of 1000000 seemed more ideal. See [model 2](#model2)


## Conclusion
### Model 1
<a id="model1"></a> 

Learning rate 0.0001 and epsilon decay 1000

<img src="images/eps_decay.JPG">

### Model 2
<a id="model2"></a> 
Learning rate 0.001 and epsilon decay 10000
Tried: 
- Higher epsilon end of 0.2

Retrained 5 times with different settings from a snapshot taken at 1500 episodes

<img src="images/memory_size.JPG">

### Parameters
### Training
### Future work
There are many potential improvements to the standard DQN algorithm that could be applied to this problem. Specificially the prioritized replay approach could be beneficial in this scenario as the agent experiences many interactions that are of little value for learning. Like walking from A to B without collecting banana or walking against the wall. As the [Rainbow](https://arxiv.org/abs/1710.02298) paper shows prioritized experience replay together with multi-step bootstrap targets provide the best improvements to the standard DQN algorithm.
