# Twin Delayed Deep Deterministic Policy Gradient (TD3)

- extended actor/critic method for continues action and input space
- off policy model
- the model uses one actor-network to calculate the actions given the states 
- the model uses two critics-networks(the twins) to calculate the q - Value given the actions and states

### Initial Set up:

Actor:
- build one actor-network and one actor target network(both share the same architecture)

![ActorModel](img/actor_model.jpg)

Critic Twins:
- build two critic networks(the twins) and two critic target networks(all sharing the same architecture)

![CriticModel](img/critic_model.jpg)

Full Model:
- in total the full model consists of 6 neural networks

![FullModel](img/full_model.jpg)

Replay Buffer:
- initialize an experience replay buffer(we need a memory buffer cause TD3 is an off-policy model)
- the buffer will be storing the last n transitions
- transition consists of state, next state, action, reward and done information

### Algorithm

Exploration Phase:
- fill the replay buffer with n transitions based on random actions

Training Phase:

Step 1:
- sample a batch of transitions(state($s$), next state($s'$), action($a$), reward($r$) and done information)

For each transition in batch

Step 2:
- use next state($s'$) to calculate the next action($a'$) using the actor target network
$a' = \pi_{\phi'}(s')$

Step 3:
- add noise($\epsilon$) from a gaussian distribution($\mathcal{N}$) to the next action($a'$) calculated in step 2(this leads to "exploration")

$a' \leftarrow a' + \epsilon$

$\epsilon = clip(\mathcal{N}(0, \tilde\sigma), -c, c) $

- clamp the updated noisy next action($a'$) in the range min/max action value range supported by the environment

$a' \leftarrow clip(a')$

Step 4:
- use next state($s'$) and our calculated and updated next action($a'$) as input for the two critic target models to calculate two q-values $Q_1(s', a')$ and $Q_2(s', a')$ 

Step 5:
- calculate the minimum of $Q_1(s', a')$ and $Q_2(s', a')$
- Qmin is an approximation of the value of the next state
- by calculating the minimum, we prevent to optimistic estimates which is a common problem in previous models

$Qmin = min(Q_1(s', a'), Q_2(s', a'))$

Step 2 till 5 visualized:

![FullModel](img/forward_pass.jpg)

Step 6:
- calculate the target q value for the two critic models based on $Q_min$

$Q_t = r + \gamma * Q_min$

- $\gamma$ is the discount factor

Step 7:
- use state($s$) and the action($a$) as input for the two critic models to calculate two q-values $Q_1(s, a)$ and $Q_2(s, a)$ 

Step 8:
- calculate the loss for the two critic models based on $Q_1(s, a)$, $Q_2(s, a)$ and $Q_t$

$Loss_{Critic2} = MSELoss(Q_1, Q_t)$

$Loss_{Critic2} = MSELoss(Q_2, Q_t)$

Step 9:
- update the weights of the two critic models based on $Loss_{Critic}$ using backpropagation 

For every n iterations(this is where the delayed in TD3 comes from):

Step 10:
- update the weights of the actor model by performing gradient ascent on the output of the first critic model
- we basically update the actor parameters in the way that maximizes the q values which maximizes the expected return

$\triangledown_\phi J(\phi) = N^{-1}\sum\triangledown_aQ_{\theta_1}(s,a)|_{a=\pi_\phi(s)} \triangledown_\phi \pi_\phi(s)$

$\phi$ = Actor weights, $\theta_1$ = Critic weights

Step 11:
- smoothly update weights of the actor target using polyak averaging

$\theta' \leftarrow \tau\theta + (1-\tau) * \theta'$ 

Step 12:
- smoothly update weights of the two critic targets using polyak averaging

$\phi' \leftarrow \tau\phi + (1-\tau) * \phi'$ 