# Deep Reinforcement Learning Nanodegree: Project 2 - Continuous Control - Report


### 1. General:

The goal of this project was to train an agent, represented by a double-jointed arm, to maintain its position at the target location(great green sphere) for as many time steps as possible. 

[//]: # (Image References)

<br>
Random Agent:

[image1]: https://raw.githubusercontent.com/cpow-89/Deep_Reinforcement_Learning_Nanodegree_Project_2_Continuous_Control/master/images/untrained_agent.gif?token=AmwnwlXyXniU-umlY4BNx8VSfAnYd57mks5bxNYIwA%3D%3D "Random Agent"

![Random Agent][image1]

### 2. Learning algorithm

General Information:

The used learning algorithm is called Deep Deterministic Policy Gradient(DDPPG) and was introduced in the "Continuous control with deep reinforcement learning" paper by Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra.
They adopt the ideas underlying the success of Deep Q-Learning to the continuous action domain. 
The algorithm is often classified as an "Actor-Critic" method, but it can also be classified as a DQN method for continuous action spaces.
The reason for this is that the critic network in DDPG is used to approximate the maximizer over the q values
of the next state and not as a learned baseline as in other "Actor-Critic" methods.

Intuition:

- we use two deep neural networks(one the representing the actor and one serving the critic)
    - we also use a copy of each network as a target network to get a more stable learning phase
- the actor is used to approximate the optimal policy deterministically
    - the actor always outputs the best-believed action for a given state
    - the actor is basically learning $argmax_aQ(s,a)$
- the critic learns to evaluate the optimal action-value function by using the actor's best-believed action

Algorithm:

- initialize replay buffer $R$
- initialize a random process $N$ for action exploration (Ornstein Uhlenbeck Noise)
- set up agent
    - register replay buffer $R$ and random process $N$
    - randomly initialize actor network $μ(s|\theta^\mu)$  with weights $\theta^\mu$
    - initialize actor target network $Q'$ with weights $\theta'^\mu \leftarrow \theta^\mu$
    - randomly initialize critic network $Q(s,a|\theta^Q)$ with weights $\theta^Q$
    - initialize critic target network $Q'$ with weights $\theta'^Q \leftarrow \theta^Q$

- for episode = 1, max_number_of_episodes do:
    - reset random process $N$
    - receive initial observation state $s_1$
    - for t = 1, T do:
        - select action $a_t = \mu(s_t|\theta^\mu ) + N_t$ according to the current policy and exploration noise
        - execute action $a_t$ and observe reward $r_t$ and observe new state $s_{t+1}$
        - store transition $(s_t, a_t, r_t, s_{t+1} )$ in $R$
        - sample a random minibatch of $N$ transitions $(s_i, a_i, r_i, s_{i+1})$ from $R$
        - set $y_i = r_i + \gamma * Q'(s_{i+1}, \mu'(s_{i+1}|\theta^{\mu'})|\theta^{Q'})$
        - update critic by minimizing the loss
        - update the actor policy using the sampled policy gradient
        - soft update the target networks
    - end for
- end for

### 3. Hyperparameters
- hyperparameters can be found in the config file

buffer_size: 1000000 
- the number of experience tuples we can save to our experience replay buffer
- this value should be high to save as much experience as possible

batch_size: 124
- number of states, actions, rewards, next_states, dones tuples sampled from the experience buffer during training

n_inputs: 33 
- number of signals in the input vector

n_actions: 4 
- number of signals in the action vector

gamma: 0.99
- a decay factor for future rewards meaning received rewards currently should have more value than uncertain future rewards
- the value should be close to 1 cause we only took one step into the future into account
- hyperparameter was chosen according to Part 7: Experiment Details in the "Continuous control with deep reinforcement learning" paper 

tau: 0.001
- value determines the step size of the soft network to target weight update
- the value should be close to 0 to get a more stable learning process
- hyperparameter was chosen according to Part 7: Experiment Details in the "Continuous control with deep reinforcement learning" paper 

learning_rate_actor: 0.0001
- the rate at which the actor-network is updated(how big are the weight update steps)
    - huge values lead to fast learning but will probably overshoot the optimum
    - small values might lead to very slow learning
    - hyperparameter was chosen according to Part 7: Experiment Details in the "Continuous control with deep reinforcement learning" paper 
    
fc_units_actor: 400, 300
- units for the fc layers in the actor-network
- hyperparameter was chosen according to Part 7: Experiment Details in the "Continuous control with deep reinforcement learning" paper 

learning_rate_critic: 0.0003
- the rate at which the critic network is updated(how big are the weight update steps)
    - huge values lead to fast learning but will probably overshoot the optimum
    - small values might lead to very slow learning
    - hyperparameter was chosen according to Part 7: Experiment Details in the "Continuous control with deep reinforcement learning" paper 
    
fc_units_critic: [400, 300]
- units for the fc layers in the critic network
- hyperparameter was chosen according to Part 7: Experiment Details in the "Continuous control with deep reinforcement learning" paper 

l2_weight_decay: 0.01
- hyperparameter was chosen according to Part 7: Experiment Details in the "Continuous control with deep reinforcement learning" paper 

Ornstein Uhlenbeck Noise:
"mu": 0
- hyperparameter was chosen according to Part 7: Experiment Details in the "Continuous control with deep reinforcement learning" paper 

"theta": 0.15
- hyperparameter was chosen according to Part 7: Experiment Details in the "Continuous control with deep reinforcement learning" paper 

"sigma": 0.2
- hyperparameter was chosen according to Part 7: Experiment Details in the "Continuous control with deep reinforcement learning" paper 


### 4. Network architectures

Critic + Critic_Target:

DDPGCritic(<br>
&nbsp;&nbsp;&nbsp;&nbsp;(state_head): Sequential(<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(0): Linear(in_features=33, out_features=400, bias=True)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(1): ReLU()<br>
&nbsp;&nbsp;&nbsp;&nbsp;)<br>
&nbsp;&nbsp;&nbsp;&nbsp;(state_action_body): Sequential(<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(0): Linear(in_features=404, out_features=300, bias=True)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(1): ReLU()<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2): Linear(in_features=300, out_features=1, bias=True)<br>
&nbsp;&nbsp;&nbsp;&nbsp;)<br>
)<br>

Actor + Actor_Target:

DDPGActor(<br>
&nbsp;&nbsp;&nbsp;&nbsp;(network): Sequential(<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(0): Linear(in_features=33, out_features=400, bias=True)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(1): ReLU()<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2): Linear(in_features=400, out_features=300, bias=True)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(3): ReLU()<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(4): Linear(in_features=300, out_features=4, bias=True)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5): Tanh()<br>
&nbsp;&nbsp;&nbsp;&nbsp;)<br>
)<br>

### 5. Results

Trained Agent:

[image2]: https://raw.githubusercontent.com/cpow-89/Deep_Reinforcement_Learning_Nanodegree_Project_2_Continuous_Control/master/images/trained_agent.gif?token=Amwnwv58uwb_JY6Z0p0_vJrWmnnl-0Eeks5bxNVywA%3D%3D "Trained Agent"
![Trained Agent][image2]


### 6. Ideas for Future Work
- add Prioritized Experience Replay and use the weight initialization suggested in the original DDPG paper
    - should lead to faster and more stable learning