# Project 2: Continuous Control

### Algorithm (DDPG)

1. Theory:

Basically, the actor learns to optimize the policy with the maximum state-action estimate for a given state, thus, following a detreministic policy, as opposed to a stochastic one. Again, this is due to maximizing the estimate for a state-action value. Thus, one needs to optimize:

$$argmax_{\mu}E_{S \sim D}[Q(s, \mu(s; \theta_{\mu}); \theta_{Q})]$$  

where $D=\{(s, a, r, s^{'}, d)_{0}, ..., (s, a, r, s^{'}, d)_{t}, ..., (s, a, r, s^{'}, d)_{T}\}$ is the buffer for the replay-memory.

On the other hand, the critic learns to approximate the optimal state-value function, $Q^{*}(s, a; \theta_{Q})$ where $a = \mu(s; \theta_{\mu})$ (detreministic policy that maximizes the probability of taking action $a$ given one is in state $s$). For stability, one uses a target function approximator and tries to learns how to maximize it:

$$argmin_{\theta_{Q}}E_{S,A \sim D}[(Q(S_{t}, A_{t}; \theta_{Q}) - y_{t})^{2}]$$

where $y_{t} = r(S_{t}, A_{t}) + \gamma * Q(S_{t + 1}, A_{t + 1}; \theta_{Q})$

The algorithm uses soft-updating for added stability (just as in the DQN algorithm):

$$\theta_{\mu}^{target} = \tau * \theta_{\mu} + (1 - \tau) * \theta_{\mu}^{target}$$
$$\theta_{Q}^{target} = \tau * \theta_{Q} + (1 - \tau) * \theta_{Q}^{target}$$



2. Algorithm:

Input: $\theta_{Q}^{initial}$, $\theta_{mu}^{initial}$, $D$

Set target parameters $\theta_{Q}^{target} = \theta_{Q}$, $\theta_{\mu}^{target} = \theta_{\mu}$

**Repeat**

Observe state $s$ and choose action $a = clip(\mu(s; \theta_{\mu}) + \epsilon)$ where $\epsilon \sim N(0, 1)$

Execute action $a$

Observe next state $s^{'}$, reward $r$ and *done* signal (if $s^{'}$ is terminal)

Store $(s, a, r, s^{'}, d)$ in $D$

**if** $d$ is true **then** reset the environment

**if** it is time to update **then**

**for** however many updates **do**

Sample, at random, from $D$, $B = \{(s, a, r, s^{'}, d)\}$

Compute each $y_{t} = r(S_{t}, A_{t}) + \gamma * Q(S_{t + 1}, \mu(S_{t + 1}; \theta_{\mu}^{target}); \theta_{Q}^{target})$

Update critic, $\nabla_{\theta_{Q}}\frac{1}{|B|}\sum_{(S_{t}, A_{t}, R_{t + 1}, S_{t + 1}, d_{t + 1}) \in B}(Q(S_{t}, A_{t}; \theta_{Q}) - y_{t})^{2}$

Update actor, $\nabla_{\theta_{mu}}\frac{1}{|B|}\sum_{t=1}^{B}Q(S_{t}, \mu(S_{t}; \theta_{mu}), \theta_{Q})$

Soft-update with

$\theta_{\mu}^{target} = \tau * \theta_{\mu} + (1 - \tau) * \theta_{\mu}^{target}$

$\theta_{Q}^{target} = \tau * \theta_{Q} + (1 - \tau) * \theta_{Q}^{target}$

**end for**

**end if**
    
**until** convergence

3. Parameters:

$seed = 1$

$BatchSize = 128$

$MemorySize = 1e5$

$\gamma = 0.99$

$lr_{actor} = 1e-3$

$lr_{critic} = 1e-3$


### Architecture

Actor: 3 fully connected layers (`nn.Linear`) with a tanh activation on the output to bound the output between -1 and 1. There is one batch normalization layer after the first fully connected one (`nn.BatchNorm1d`).

![alt](imgs/actor_architecture.png)

Critic: 3 fully connected layers where the second fully connected layer accepts as input (output_fc1 + action). There is one batch normalization layer after the first fully connected one.

![alt](imgs/critic_architecture.png)

All other activation functions are `F.leaky_relu`, in order to avoid dead neurons when learning.

### Training results

![alt](imgs/plot_scores.png)

### Ideas for future work

Given the parallel agent environment, I would like to try out `A3C` or `D4PG`. I believe that these would work well. Anything that exploits the multiple agents would benefit greatly.

I would also like to try Trust Region Policy Optimization (TRPO), as it seems fitting for the continuous space.

Another improvement I could intriduce would be to prioritize the replay memory as many papers do. Not all states would generate equal contributions to the learning process.