<a href="https://colab.research.google.com/github/abyaadrafid/Deep-Reinforcement-Learning/blob/master/Policy%20Gradients/Synchronous_Advantage_Actor_Critic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advantage Actor Critic
Advantage actor critic variants are perhaps the most well-known algorithms in the actor critic family. Actor critic methods try to incorporate the best from value based and policy based methods. As we have previously known from Q learning variants, there is a connection between the value function and the policy. Actor critic methods modify the policy gradient algorithm to utilize the function.

In the policy gradient notebook, we have used the discounted returns from actions along with the log probability of the policy to learn our target policy. As it happens, using baselines for expected discounted returns yield better results as the policy updates become smaller; reducing chance of the policy degrading arbitrarily. And the value function serves as a good baseline.

If we consider V(t) as our value function and G our discounted return, instead of using G directly in policy gradient, we can use this : 
```
G - V(s)
A(s) = Q(s,a) - V(s)
```
Also, G is just the Q value in this context. Replacing G with Q(s,a), we get the advantage function A(t), which describes how good or bad an action is. It tells about the extra reward that could be obtained by the agent by taking that particular action. As we are using the advantage function, it is called advantage actor critic. There are variants that use state value or q value directly. 

## Models
There are two models in actor critic methods, which optionally share parameters:

1. The critic : Updates value function parameters. The value function can be action value or state value depending on the implementation specification. In this case we are going to use the state value function.
2. The actor : Updates policy parameters as suggested by the critic.

## A2C over A3C
Asynchronous Actor Critic is one of the most influential Actor Critic methods. A3C provided distributed training over multiple cpu. However, researchers found that the Synchronous version of the algorithm is better suited for GPU training.

The noise introduced by the Asynchronity was initially thought help with regularilization. But the Synchronous variant is more effective with GPUs. That is why we will be implementing A2C rather than A3C. 


In [1]:
!apt install swig cmake libopenmpi-dev zlib1g-dev
!pip install stable-baselines[mpi]==2.10.0 box2d box2d-kengz

Reading package lists... Done
Building dependency tree       
Reading state information... Done
zlib1g-dev is already the newest version (1:1.2.11.dfsg-0ubuntu2).
zlib1g-dev set to manually installed.
libopenmpi-dev is already the newest version (2.1.1-8).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  swig3.0
Suggested packages:
  swig-doc swig-examples swig3.0-examples swig3.0-doc
The following NEW packages will be installed:
  swig swig3.0
0 upgraded, 2 newly installed, 0 to remove and 39 not upgraded.
Need to get 1,100 kB of archives.
After this operation, 5,822 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig3.0 amd64 3.0.12-1 [1,094 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig amd64 3.0.12-1 [6,

In [2]:
import random
import sys
from time import time
from collections import deque, defaultdict, namedtuple
import numpy as np
import pandas as pd
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

import matplotlib.pyplot as plt
from google.colab import drive 
%matplotlib inline

plt.style.use('seaborn')

In [3]:
env = gym.make('LunarLander-v2')
env.seed(0)
print(env.action_space)
print(env.observation_space)

Discrete(4)
Box(8,)


In [4]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

In [6]:
class ActorCritic(nn.Module):
  def __init__(self, state_size, action_size, fc1_size = 128, fc2_size = 256):
    super(ActorCritic, self).__init__()

    self.stem = nn.Sequential(
        nn.Linear(state_size, fc1_size),
        nn.ReLU(),
        nn.Linear(fc1_size, fc2_size),
        nn.ReLU()
    )

    self.actor = nn.Sequential(
        self.stem,
        nn.Linear(fc2_size, 1)
    )
    
    self.critic = nn.Sequential(
        self.stem,
        nn.Linear(fc2_size, 1),
        nn.Softmax(dim=1)
    )

  def forward(self, x):
    value = self.critic(x)
    probabilities = self.actor(x)
    dist = Categorical(probabilities)

    return dist, probabilities