# The School Of AI Assignment


## **Phase 2 Session 9: Assignment**

1. Well, there is a reason why this code is in the image, and not pasted.
2. You need to:
    1. write this code down on a Colab file, upload it to GitHub.
    2. write a Readme file explaining all the 15 steps we have taken:
        1. read me must explain each part of the code
        2. each part of the code must be accompanied with a drawing/image (you cannot use the images from the course content)
    3. Upload the link.

## **Twin Delayed DDPG (TD3)**
* **DDPG** stands for **Deep Deterministic Policy Gradient** and is a recent breakthrough in AI, particularly in the case of environments with continuous action spaces.
* To be able to apply Q-learning to continuous tasks, the authors introduced the Actor-Critic model.
* Actor-Critic has 2 neural networks that the following way:
    1. The Actor is the policy that takes as input the State and outputs Actions
    2. The Critic takes as input States and Actions concatenated together and outputs a Q-value
* The Critic learns the optimal Q-values which are then used to for gradient ascent to update the parameters of the Actor.
* By combining learning the Q-values (which are rewards) and the parameters of the policy at the same time, we can maximize expected reward.


## **TD3 Steps With Screenshots**

<img width="1083" alt="Screenshot 2020-03-31 at 1 35 11 PM" src="https://user-images.githubusercontent.com/15984084/78004576-ec2de080-7357-11ea-941c-6a9987ba3073.png">

In [0]:
## INITIALIZATION
import os
import time
import random
import numpy as np
import matplotlib.pyplot as plt
import pybullet_envs # helps to do 3D manipulations
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
from gym import wrappers
from torch.autograd import Variable
from collections import deque


### **Step1**: Define Experience Replay Memory
<img width="922" alt="Screenshot 2020-03-31 at 1 17 51 PM" src="https://user-images.githubusercontent.com/15984084/78001064-cd791b00-7352-11ea-960f-10ba8eb25a0b.png">


In [0]:
##  STEP 1
'''
Define a Class for Experience Replay Memory which has 3 functions:
1. __init__(): Initialize memory with size of 1e6 with a size of 1e6.
2. add(): Functionality to provide addtion/appending support to the replay memory.
3. sample(): Functionality to get random samples from replay memory.
'''
class ReplayBuffer(object):
    def __init__(self, max_size=1e6):
        self.storage = []
        self.max_size = max_size
        self.ptr = 0

    def add(self, transition):
        if len(self.storage) == self.max_size:
            self.storage[int(self.ptr)] = transition
            self.ptr = (self.ptr + 1) % self.max_size
        else:
            self.storage.append(transition)

    def sample(self, batch_size):
        ind = np.random.randint(0, len(self.storage), batch_size)
        batch_states, batch_next_states, batch_actions, batch_rewards, \
            batch_dones = [], [], [], [], []
        for i in ind:
            state, next_state, action, reward, done = self.storage[i]
            batch_states.append(np.array(state, copy=False))
            batch_next_states.append(np.array(next_state, copy=False))
            batch_actions.append(np.array(action, copy=False))
            batch_rewards.append(np.array(reward, copy=False))
            batch_dones.append(np.array(done, copy=False))
        return np.array(batch_states), np.array(batch_next_states), \
            np.array(batch_actions), np.array(batch_rewards).reshape(-1, 1), \
                np.array(batch_dones).reshape(-1, 1)

### **Step2**: Define Actor Model
<img width="717" alt="Screenshot 2020-03-31 at 1 17 59 PM" src="https://user-images.githubusercontent.com/15984084/78001083-d2d66580-7352-11ea-9fc1-524c72bb431a.png">

In [0]:
##  STEP 2
'''
Build two DNNs: one for Actor model and one for Actor Target with the same definition
__init__() :    state_dims - how many variables are there in a state
                acton_dims - how many actions can be taken (they are floating point number but dimensions are fixed)
                max_action - limit for each action
                Initialize Actor

forward(): forward propagation to return what actions to take and by how much

'''
class Actor (nn.Module):
    def __init__(self, state_dims, action_dim, max_action):
        # max_action is to clip in case we added too much noise
        super(Actor, self).__init__() # activate the inheritance
        self.layer_1 = nn.Linear(state_dims, 400)
        self.layer_2 = nn.Linear(400, 300)
        self.layer_3 = nn.Linear(300, action_dim)
        self.max_action = max_action

    def forward(self, x):
        x = F.relu(self.layer_1(x))
        x = F.relu(self.layer_2(x))
        x = self.max_action * torch.tanh(self.layer_3(x))
        return x

### **Step3**: Define Critic Model
<img width="817" alt="Screenshot 2020-03-31 at 1 19 10 PM" src="https://user-images.githubusercontent.com/15984084/78001100-d964dd00-7352-11ea-8bd4-43c918b77d83.png">


In [0]:
## Step 3
'''
Build two DNNs for - the two Critic models and two Critic Targets
Both Critic(model and targe) have diffent weights.
__init__() :    state_dims - how many variables are there in a state
                acton_dims - how many actions can be taken (they are floating point number but dimensions are fixed)
                Initialize Two Critics

forward(): forward propagation to return max Q value

Q1: We have taken the first critic for training of the actor
    because we don't want any back progagation to happen here so
    kept separately.
'''
class Critic(nn.module):
    def __init__(self, state_dims, action_dim):
        super(Critic, self).__init__() # activate the inheritance

        # First Critic Network
        self.layer_1 = nn.Linear(state_dims + action_dim, 400)
        self.layer_2 = nn.Linear(400, 300)
        self.layer_3 = nn.Linear(300, action_dim)

        # Second Critic Network
        self.layer_4 = nn.Linear(state_dims + action_dim, 400)
        self.layer_5 = nn.Linear(400, 300)
        self.layer_6 = nn.Linear(300, action_dim)

    def forward(self, x, u): # x - state, u - action
        xu = torch.cat([x, u], 1) # 1 for vertical concatenation, 0 for Horizontal

        # forward propagation on first critic
        x1 = F.relu(self.layer_1(xu))
        x1 = F.relu(self.layer_2(x1))
        x1 = self.layer_3(x1)

        # forward propagation on second critic
        x2 = F.relu(self.layer_4(xu))
        x2 = F.relu(self.layer_5(x2))
        x2 = self.layer_6(x2)

        return x1, x2

    def Q1(self, x, u): # x - state, u = action; this is used for updating Q values
        xu = torch.cat([x, u], 1) # 1 for vertical concatenation, 0 for Horizontal
        x1 = F.relu(self.layer_1(xu))
        x1 = F.relu(self.layer_2(x1))
        x1 = self.layer_3(x1)
        return x1

### **Step4**: Get Random sample from Experience Replay Memory
<img width="1054" alt="Screenshot 2020-03-31 at 1 19 30 PM" src="https://user-images.githubusercontent.com/15984084/78001122-e255ae80-7352-11ea-98cd-b151529fafd7.png">

<img width="973" alt="Screenshot 2020-03-31 at 1 19 40 PM" src="https://user-images.githubusercontent.com/15984084/78001146-eb468000-7352-11ea-81ee-c1542a3dbd43.png">


In [0]:
## Step 4-15
'''
Training Process - Create T3D Class
__init__():     state_dims - how many variables are there in a state
                acton_dims - how many actions can be taken (they are floating point number but dimensions are fixed)
                max_action - limit for each action
                Initialize  T3D Model

select_action():

'''
# Selecting the device (CPU or GPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Building the whole Training Process into a class
class T3D(object):
    def __init__(self, state_dims, action_dim, max_action):
        # making sure our T3D class can work any env
        self.actor = Actor(state_dims, action_dim, max_action).to(device) # Gradient Descent
        self.actor_target = Actor(state_dims, action_dim, max_action).to(device) # Polyak Averaging
        self.actor.load_state_dict(self.actor.state_dict)
        # initializing with model weights to keep them same
        self.actor_optimizer = torch.optim.Adam(self.actor.parameters())

        self.critic = Critic(state_dims, action_dim).to(device) # Gradient Descent
        self.critic_target = Critic(state_dims, action_dim).to(device) # Polyak Averaging
        self.critic_target.load_state_dict(self.critic.state_dict)
        # initializing with model weights to keep them same
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters())
        self.max_action = max_action

    def select_action(self, state):
        state = torch.Tensor(state.reshape(1, -1,)).to(device)
        return self.actor(state).cpu().data.numpy().flatten()

    def train(self, replay_buffer, iterations, batch_size=100, discount=0.99,
    tau=0.005, policy_noise=0.2, noise_clip=0.5, policy_freq=2):
        for it in range(iterations):
            ## Step 4
            '''We sample from a batch of transitions (s, s', a, r) from memory'''
            batch_states, batch_next_states, batch_actions, batch_rewards, batch_dones \
                = replay_buffer.sample(batch_size)
            state = torch.Tensor(batch_states).to(device)
            next_state = torch.Tensor(batch_next_states).to(device)
            action = torch.Tensor(batch_actions).to(device)
            reward = torch.Tensor(batch_rewards).to(device)
            done = torch.Tensor(batch_dones).to(device)

            ## Step 5
            '''From the next state s', the actor target plays the next action a'''
            next_action = self.actor_target.forward(next_state)

            ## Step 6
            '''We add Gaussian noise to this next action a' and
            we clamp it in a range of values supported by the environment'''
            noise = torch.Tensor(batch_actions).data.normal_(0, policy_noise).to(device)
            noise = noise.clamp(-noise_clip, noise_clip)
            next_action = (next_action + noise).clamp(-self.max_action, self.max_action)

            ## Step 7
            '''The two Critic targets take each the couple (s', a') as input
            and return two Q values, Qt1(s', a') and Qt2(s', a') as outputs'''
            target_Q1, target_Q2 = self.critic_target.forward(next_state, next_action)

            ## Step 8
            ''' Keep the minimum of these two Q-Values'''
            target_Q = torch.min(target_Q1, target_Q2)

            ## Step 9
            '''
            We get the final target of the two Critic models, which is:
                Qt = r + gamma * min(Qt1, Qt2)
                target_Q = reward + discount * target_Q

            We can't run the above equation efficiently as some components
            are in Computational graphs and some are not. We need to make
            two minor modification:
            First, we are only supposed to run this if the episode is over,
                which means we need to integrate Done
            Second, target_q would create it's BP/computation graph, and without
                detaching Qt1/Qt2 from their own graph, we are complicating things,
                i.e. we need to use detach.
            *NOTE* => Done=1 (episode over), Done=0 (episode not over)
            '''
            target_Q = reward + ((1-done) * discount * target_Q).detach()

            ## Step 10
            ''' Two critic model take (s,a) and return two Q-Values'''
            current_Q1, current_Q2 = self.critic.forward(state, action)

            ## Step 11
            '''Compute the loss coming from two critic models'''
            critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)

            ## Step 12
            ''' Backpropagate the critic loss and update the parameters of two Critic models with a Adam optimizer'''
            self.critic_optimizer.zero_grad() # initializing the gradients to zero
            critic_loss.bakward() # computing the gradients
            self.criti_optimizer.step() # performing the weight updates

            ## Step 13
            '''Once every two iterations, we update our Actor model by  performing
            gradient ASCENT on the output of the first Critic model'''
            if it % policy_freq == 0:
                # This is DPG part
                actor_loss = -(self.critic.Q1(state, self.actor(state)).mean())
                self.actor_optimizer.grad_zero()
                actor_loss.backward()
                self.actor_optimizer.step()

                ## Step 14
                '''Still, in once every two iterations, we update our Actor
                Target by Polyak Averaging'''
                for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
                    target_param.data.copy_(tau * param.data + (1-tau)* target_param.data)

                ## Step 15
                '''Still, in once every two iterations, we update our Critic
                Target by Polyak Averaging'''
                for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
                    target_param.data.copy_(tau * param.data + (1-tau)* target_param.data)

### **Step5**: Get Next Action (a’) from Next State(s’) using actor target
<img width="895" alt="Screenshot 2020-03-31 at 1 19 48 PM" src="https://user-images.githubusercontent.com/15984084/78001160-ef729d80-7352-11ea-8e07-7a3e892a5abe.png">

### **Step6**: Gaussian Noise to Next Action (a’) and clamp to a range
<img width="1039" alt="Screenshot 2020-03-31 at 1 20 04 PM" src="https://user-images.githubusercontent.com/15984084/78001188-f7cad880-7352-11ea-8fc5-413ae635421a.png">

### **Step7**: Two Critic Target take s’ and a’ as input and return two Q Values
<img width="1012" alt="Screenshot 2020-03-31 at 1 20 14 PM" src="https://user-images.githubusercontent.com/15984084/78001226-03b69a80-7353-11ea-9c54-540a04c65a87.png">


### **Step8**: Take Minimum of Q1 and Q2 and output Target Q
<img width="784" alt="Screenshot 2020-03-31 at 1 20 22 PM" src="https://user-images.githubusercontent.com/15984084/78001254-0fa25c80-7353-11ea-9e40-06181bf62999.png">

### **Step9**: Get the final target of the two Critic models
<img width="979" alt="Screenshot 2020-03-31 at 1 20 28 PM" src="https://user-images.githubusercontent.com/15984084/78001284-1d57e200-7353-11ea-85a1-7659369569cf.png">

### **Step10**: Two Critic Model take current state(s) and current action (a) as input and return two current Q Values
<img width="1007" alt="Screenshot 2020-03-31 at 1 20 35 PM" src="https://user-images.githubusercontent.com/15984084/78001301-22b52c80-7353-11ea-8db9-0b916c151367.png">

### **Step11**: Compute the Critic Loss
<img width="831" alt="Screenshot 2020-03-31 at 1 20 40 PM" src="https://user-images.githubusercontent.com/15984084/78001415-44aeaf00-7353-11ea-9dab-99b4f0056497.png">

### **Step12**: Backpropagate the critic loss and update the parameters of two Critic models
<img width="1133" alt="Screenshot 2020-03-31 at 1 20 49 PM" src="https://user-images.githubusercontent.com/15984084/78001456-51cb9e00-7353-11ea-9288-3223cba43a1f.png">

### **Step13**:  Once every two iterations, we update our Actor model
<img width="1242" alt="Screenshot 2020-03-31 at 1 21 19 PM" src="https://user-images.githubusercontent.com/15984084/78001508-6740c800-7353-11ea-844b-1be0fb7d75d4.png">

### **Step14**: Still, in once every two iterations, update our Actor Target by Polyak Averaging
<img width="1247" alt="Screenshot 2020-03-31 at 1 21 28 PM" src="https://user-images.githubusercontent.com/15984084/78001623-97886680-7353-11ea-977e-8cb9a77f1eda.png">

### **Step15**: Still, in once every two iterations, we update our Critic  Target by Polyak Averaging
<img width="1226" alt="Screenshot 2020-03-31 at 1 21 43 PM" src="https://user-images.githubusercontent.com/15984084/78001797-d9b1a800-7353-11ea-91a0-e7d6e9169dbb.png">