<a href="https://colab.research.google.com/github/abyaadrafid/Deep-Reinforcement-Learning/blob/master/Vanilla_Policy_Gradient_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# REINFORCE
Reinforce is a monte carlo policy gradient method. It relies on estimated discounted returns using episode samples to update the parameters of the policy.
We use a neural network to approximate the optimal policy of the agent. The policy of the agent is basically a probability distribution that describes the mapping of states to actions of the agent. 

Value based methods try to avoid learning the policy because it is often harder than value estimation. However, in some applications, approximating value functions is more complex that learning the optimal policy itself. This makes value iteration methods like Q learning less than ideal.

REINFORCE takes gradients of the samples and uses them to update the policy iteratively. Rewards are collected from real trajectories. It relies on a full trajectory and that's why it is a Monte-Carlo method. It does not keep any memory of older trajectories. That is why we will not be using any replay buffers. We just need to store state trajectory for each episode until the episode is finished. 

We use the log probability of agent actions and discounted returns to update the policy. 

In [1]:
!apt install swig cmake libopenmpi-dev zlib1g-dev
!pip install stable-baselines[mpi]==2.10.0 box2d box2d-kengz

Reading package lists... Done
Building dependency tree       
Reading state information... Done
zlib1g-dev is already the newest version (1:1.2.11.dfsg-0ubuntu2).
zlib1g-dev set to manually installed.
libopenmpi-dev is already the newest version (2.1.1-8).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  swig3.0
Suggested packages:
  swig-doc swig-examples swig3.0-examples swig3.0-doc
The following NEW packages will be installed:
  swig swig3.0
0 upgraded, 2 newly installed, 0 to remove and 39 not upgraded.
Need to get 1,100 kB of archives.
After this operation, 5,822 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig3.0 amd64 3.0.12-1 [1,094 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 swig amd64 3.0.12-1 [6,

In [67]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import sys
import numpy as np
import pandas as pd
import gym
import matplotlib.pyplot as plt
from torch.distributions import Categorical
%matplotlib inline

plt.style.use('seaborn')

In [3]:
env = gym.make('LunarLander-v2')
env.seed(0)
print(env.action_space)
print(env.observation_space)

Discrete(4)
Box(8,)


In [4]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

In [5]:
class PolicyNetwork(nn.Module):
  def __init__(self, state_size, fc1_size, fc2_size, action_size):
    super(PolicyNetwork, self).__init__()
    
    self.layers = nn.Sequential(
      nn.Linear(state_size, fc1_size),
      nn.ReLU(),
      nn.Linear(fc1_size, fc2_size),
      nn.ReLU(),
      nn.Linear(fc2_size, action_size),
      nn.Softmax(dim = 1)
    )

  def forward(self, state):
    x = self.layers(state)
    dist = Categorical(x)

    return dist

In [111]:
class PolicyGradientAgent():
  def __init__(self, lr, state_size, action_size, fc1_size = 128, fc2_size = 256,):
    self.policy_network = PolicyNetwork(state_size, fc1_size, fc2_size, action_size).to(device)
    self.optimizer = optim.Adam(self.policy_network.parameters(), lr)
    self.action_memory = []
    self.reward_memory = []

  def choose_action(self, state):
    action_probs = self.policy_network(state)
    action = action_probs.sample()
    log_probs = action_probs.log_prob(action)
    self.action_memory.append(log_probs)

    return action.item()

  def store_rewards(self, reward):
    self.reward_memory.append(reward)

  def learn(self):
    self.optimizer.zero_grad()
    G = np.zeros_like(self.reward_memory, dtype= np.float64)
    for t in range(len(self.reward_memory)):
      G_sum = 0
      discount = 1
      for k in range(t, len(self.reward_memory)):
        G_sum += self.reward_memory[k]*discount
        discount *= gamma
      G[t]= G_sum
    
    mean = np.mean(G)
    std = np.std(G) if np.std(G)>0 else 1

    G = (G-mean)/std
    G = torch.tensor(G, dtype= torch.float).to(device)

    loss = 0
    for g, logprob in zip(G, self.action_memory):
      loss += -g*logprob

    loss.backward()
    self.optimizer.step()

    self.action_memory=[]
    self.reward_memory=[]


In [123]:
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

print('State size: {}, action size: {}'.format(state_size, action_size))

State size: 8, action size: 4


In [133]:
policy_agent = PolicyGradientAgent(1e-3, state_size=state_size, action_size=action_size)
n_episodes = 3000
gamma = 0.99
PRINT_EVERY = 50
ENV_SOLVED = 200

In [134]:
def train(agent):
  all_score = []
  score_history = []
  score = 0
  for i in range(n_episodes):
    done = False
    score = 0
    state = env.reset()
    while not done:
      state = torch.Tensor(state).to(device)
      action = agent.choose_action(state)
      next_state, reward, done, info = env.step(action)
      agent.store_rewards(reward)
      state = next_state
      score+= reward
      
    score_history.append(score)
    agent.learn()
    all_score.append(np.mean(score_history))

    if i % PRINT_EVERY == 0 :
      print('\r Progress {}/{}, average score:{:.2f}'.format(i, n_episodes, np.mean(score_history)), end="")
    if score >= ENV_SOLVED:
      print('\rEnvironment solved in {} episodes, score: {:.2f}'.format(i, score), end="")
      sys.stdout.flush()
      break
  return all_score

In [135]:
def plot(score, string):
  plt.figure(figsize=(10,6))
  plt.plot(score)
  plt.plot(pd.Series(score).rolling(100).mean())
  plt.title('%s Training,'%string)
  plt.xlabel('# of episodes')
  plt.ylabel('score')
  plt.show()

In [136]:
scores = train(policy_agent)

  if __name__ == '__main__':


Environment solved in 590 episodes, score: 251.65