# Report

## Policy-Based Methods
Whereas *Value-Based Methods* like Deep Q-Learning are obtaining an optimal policy $\pi_*$ by trying to estimate the optimal action-value function, *Policy-Based Methods* directly learn the optimal policy.  
Besides this simplification another advantage of a Policy-Based Method is the fact that it is able to handle either stochastic or continuous actions.  
On the one hand Policy-Based Methods are using the *Monte Carlo* (MC) approach for the estimate of expected return:

$ G_t = R_{t+1} + R_{t+2} + ... + R_T$, if the discount factor $\gamma=1$

As $G_t$ is estimated with the full trajectory this yields to a high *variance*, but to a low *bias*.  
On the other hand Value-Based Methods are using the *Temporal Difference* (TD) approach to estimate the return:

$ G_t = R_{t+1} + G_{t+1}$ , if $\gamma=1$

Here $G_{t+1}$ is the estimated total return an agent will obtain in the next state. As the estimate of $G_t$ is always depending on the estimate of the next state, the variance of these estimates is low but biased.  
The pros of both methods can be combined in one single algorithm namely the Actor-Critic Method.

## Actor-Critic Methods
In Actor-Critic Methods one uses two function approximators (usually neural networks) to learn a policy (Actor) and a value function (Critic). The process looks as follows:  

1) Observe state $s$ from environment and feed into the Actor.  
2) The output are action probabilities $\pi(a|s;\theta_\pi)$. Select one action a stochastically and fede back to the environment.  
3) Observe next state $s'$ and reward $r$.  
4) Use the tuple $(s, a, r, s')$ for the TD estimate $r + \gamma V(s'; \theta_v)$ to train the Critic.  
5) Calculate the advantage $A(s,a) = r + \gamma V(s'; \theta_v) - V(s; \theta_v)$.  
6) Train the Actor using the advantage.

## Deep Deterministic Policy Gradient
Deep Deterministic Policy Gradient (DDPG) combines the actor critic approach with Deep Q-Learning. The actor function $\mu(s|\theta_\mu)$ gives the current policy. It maps states to continuous actions. The critic $Q(s,a)$ on the other hand 

## References
- Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wiestra, D., Continuous Control with Deep Reinforcement Learning, arXiv:1509.02971v5 [cs.LG] 29 Feb 2016