# T3D Implementation

---
* Twin Delayed Deep Deterministic Policy Gradient
  * **Policy** is the probability distribution of **actions** for a given state.
  * The *policy* is what agent controls. 
    * When the agent follows a policy, it generates a sequence of states, actions and rewards. 
    * It called as *trajectory*.

  * **Policy Gradient**
    * The objective of the **reinforcment learning agent** is to maximize the *(discounted) reward (from the start state)* when following the *policy*.
    * In ML setup, we define a set of parameters ($\theta$) to parmeterize the *policy*.
    * The objective is to *maximize* the "expected" reward following a parameterize policy.
    * Atleast there will be one optimal policy which can give *maximum* reward. 
    * Among all optimal policies, atleast there will be one, which is **stationary and deterministic**.
    * In ML, to maximize, we need to do **Gradient Ascent**.
    * 

  * **Deterministic Policy Gradient**
    * Learn a *deterministic action* for a given state.
    * Use **Actor-Critic Model**

  * **Deep Deterministic Policy Gradient (DDPG)**
    * The Actor and Critic are DNNs.

  * Architecutre:
    * For stability, we have **Dual** (two) Network
      * Model- Model
      * Model- Target
    * Each Model is a **Duel** network.
      * Model- Actor
      * Uses Two Critics (thus, TWIN in the algo. name).
      * Model- Critic1 & Critic2 (Yes, two Critics/ Twin).

  * **Delayed**
    * The model - *Model* is updated at every step. But, the model - *target* is updated once every two steps.

  * **Twin**
    * Two critics instead of single critic in the Actor-Critic model.


  



## References

1. [Deterministic Policy Gradient Algorithms - David Silver](http://proceedings.mlr.press/v32/silver14.pdf)



## Generic Reinforcement Learning Algorithm

---
```
Loop:
    Collect trajectories (transitions - (state, action, reward, next state, terminated flag))
    (Optionally) store trajectories in a replay buffer for sampling
    Loop:
        Sample a mini batch of transitions
        Compute Policy Gradient
        (Optionally) Compute Critic Gradient
        Update parameters
```


## T3D Implementations

### Imports

In [0]:
# general imports
import os
import time
import random
import numpy as np
import matplotlib.pyplot as plt

# gym
import gym
from gym import wrappers

#
#import pybullet_envs

# torch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from collections import deque

### Step1: Experience Replay Memory

In [0]:
#
# create cyclic buffer with capacity of 1 mil.
# to store transitions.
#
class ReplayBuffer(object):
  def __init__(self, max_size=1e6):
    self.storage = [] 
    self.max_size = max_size
    self.ptr = 0 # idx to add things to buffer.

  # add transition to the replay buffer
  def add(self, transition):
    if len(self.storage) == self.max_size:
      # buffer full, add it to the begining
      self.storage[int(self.ptr)] = transition
      self.ptr = (self.ptr + 1) % self.max_size
    else:
      # don't use ptr instead use append,
      # as memory is allocated on the fly.
      self.storage.append(transition)
  

  # sample the transitions from the buffer
  def sample(self, batch_size):
    # get the indices of the transitions
    ind = np.random.randint(0, len(self.storage), batch_size)

    # get the transitions for the batch 
    batch_states, batch_next_states, batch_actions, batch_rewards, batch_dones = [], [], [], [], []
    for i in ind:
      #unpack
      state, next_state, action, reward, done = self.storage[i]
      batch_states.append(np.array(state, copy=False))
      batch_next_states.append(np.array(next_state, copy=False))
      batch_actions.append(np.array(action, copy=False))
      batch_rewards.append(np.array(reward, copy=False))
      batch_dones.append(np.array(done, copy=False))

      # why to reshape the rewards & done?
      return np.array(batch_states), \
              np.array(batch_next_states), \
              np.array(batch_actions), \
              np.array(batch_rewards).reshape(-1,1), \
              np.array(batch_dones).reshape(-1,1)




## Step2: Build DNN for Actor

Note that the Actor Model and Actor Target are similar DNNs.


In [0]:
# input: state
# output: action
class Actor(nn.Module):
  def __init__(self, state_dims, action_dim, max_action):
    # max_action is to clip in case we added too much noise
    super(Actor, self).__init__() # init the inherited base class
    # build the layers of NN
    self.layer_1 = nn.Linear(state_dims, 400)
    self.layer_2 = nn.Linear(400, 300)
    self.layer_3 = nn.Linear(300, action_dim)
    self.max_action = max_action

  def forward(self, x):
    x = F.relu(self.layer_1(x))
    x = F.relu(self.layer_2(x))
    x = self.max_action * torch.tanh(self.layer_3(x))
    return x


## Step3: Build Critic Model

Note that there are TWIN critics.

In [0]:
# TwoCritics
#
# input: state & action
# output: q-value
#
class Critic(nn.Module):
  def __init__(self, state_dims, action_dim):
    super(Critic, self).__init__()
    # first critic
    self.layer_1 = nn.Linear(state_dims + action_dim, 400)
    self.layer_2 = nn.Linear(400, 300)
    self.layer_3 = nn.Linear(300, action_dim)
 
    # second critic
    self.layer_4 = nn.Linear(state_dims + action_dim, 400)
    self.layer_5 = nn.Linear(400, 300)
    self.layer_6 = nn.Linear(300, action_dim)  

  def forward(self, x, u ): # x - state, u - action
    xu = torch.cat([x, u], 1) # 1 for vert concatenation
    # forward prop of first critic
    x1 = F.relu(self.layer_1(xu))
    x1 = F.relu(self.layer_2(x1))
    x1 = self.layer_3(x1)
    # forward prop of second critic
    x2 = F.relu(self.layer_4(xu))
    x2 = F.relu(self.layer_5(x2))
    x2 = self.layer_6(x2)

    return x1, x2

  # this is used to update Q-values
  def Q1(self, x, u): # x - state, u - action
    xu = torch.cat([x,u], 1)
    x1 = F.relu(self.layer_1(xu))
    x1 = F.relu(self.layer_2(x1))
    x1 = self.layer_3(x1)

    return(x1)




## Step 4 thru 15: Build T3D Model and Training Procedure

* Two sets of Actor-Critic Model
  * Model- Model
  * Model - Target

---
* Step 4: Sample batch of transitions from replay memory
* Step 5: Find next_action (**a'**) using next_state (**s'**) . Use target_model.
* Step 6: Add Gausian noise to next_action and clamp it to a range of values.


---


In [0]:
# select the device - cpu or gpu
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [0]:
#Build T3D model

# input: state & action
class T3D(object):
  def __init__(self, state_dims, action_dim, max_action):
    # build T3D class
    self.actor = Actor(state_dims, action_dim, max_action).to(device) # Gradient Descent
    self.actor_target = Actor(state_dims, action_dim, max_action).to(device) # polyak averaging for stability purpose
    # initializing the model weights to keep the same.
    self.actor_target.load_state_dict(self.actor.state_dict)

    self.actor_optimizer = torch.optim.Adam(self.actor.parameters)
    # 
    # Critic
    self.critic = Critic(state_dims, action_dim).to(device)
    self.critic_target = Critic(state_dims, action_dim).to(device)
    self.critic_target.load_state_dict(self.critic.state_dict)

    self.critic_optimizer = torch.optim.Adam(self.critic.parameters)

    self.max_action = max_action

  # input : state
  # output : action
  def select_action(self, state):
    state = torch.Tensor(state.reshape(1, -1)).to(device)
    # need to return numpy
    return self.actor(state).cpu.data.numpy().flatten()

  # for each episode
  #    sample batch of transitions (step4)
  #
  def train(self, replay_buffer, iterations, batch_size=100, discount=0.99, \
            tau = 0.005, policy_noise=0.2, noise_clip=0.5, policy_freq = 2):
    for it in range(iterations):
      # step 4: sample batch of trasitions (s, s', a, r) from replay memory
      batch_states, batch_next_states, batch_actions, batch_rewards, batch_dones \
          = replay_buffer.sample(batch_size)
      # convert to torch tensors
      state      = torch.Tensor(batch_states).to(device)
      next_state = torch.Tensor(batch_next_states).to(device)
      action     = torch.Tensor(batch_actions).to(device)
      reward     = torch.Tensor(batch_rewards).to(device)
      done       = torch.Tensor(batch_dones).to(device)

      # step5: find next_action using next_state. Use actor_target model
      # why not predict?
      next_action = self.actor_target.forward(next_state)

      # step6: Add Gausian noise to next_action and clamp it to a range of values
      noise = torch.Tensor(batch_actions).data.normal_(0, policy_noise).to(device)
      noise = noise.clamp(-noise_clip, noise_clip)
      next_action = (next_action + noise).clamp(-self.max_action, self.max_action)

      # step7: Find Q values from the two target critics.
      target_Q1, target_Q2 = self.critic_target.forward(next_state, next_action)

      # step8: Use min of two q values. This is done to improve the stability of the model.
      target_Q = torch.min(target_Q1, target_Q2)

      # step9: Find target_Q value: reward + gamma * min(Qt1, Qt2)
      target_Q = reward + ( (1- done) * discount * target_Q).detach()

      # step10: Find q-value from the critic models.
      current_Q1, current_Q2 = self.critic.forward(state, action)

      # step11: Compute critic loss.
      critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)

      # step12: Backpropagate the critic loss and update the params of two Critic models.
      self.critic_optimizer.zero_grad() # init the gradients to zero
      critic_loss.backward() # compute gradients
      self.critic_optimizer.step() # perform weight updates

      # Delayed Policy Updates
      # every two iterations, update the models
      if it % policy_freq == 0 :
        # this is DPG part
        # step13: every two iterations / episodes, update the actor model by
        #         perfroming gradient ASCENT on the output of the first critic model
        #
        # negative loss is the gradient ASCENT or maximization
        actor_loss = -(self.critic.Q1(state, self.actor(state)).mean())
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # step14: Update the frozen target_actor model
        for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
          target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)

        # step 15 : Update the frozen target_critic model
        for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
          target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data)
 