## REINFORCE Algorithm

- Type of <font color='#00ba47'>policy gradient based method</font> (ie directly computes the policy without calculating the value function)
- It learns a <font color='#00ba47'>stochastic policy</font> and our neural network’s output is an action vector that represents a probability distribution (rather than returning a single deterministic action).
- Therefore REINFORCE selects an action from this probability distribution, ie if our Agent ends up in the same state twice, we may not end up taking the same action every time
- The method REINFORCE is built upon trajectories instead of episodes because maximizing expected return over trajectories (instead of episodes) lets the method search for optimal policies for both episodic and continuing tasks

Expected Return:
$$U(\theta) = \sum_{\tau} P(\tau; \theta)R(\tau)$$ 
or
$$\nabla_{\theta}U(\theta) = \sum_{t=0}^{H}\nabla_{\theta} \log \pi_{\theta}(a_t | s_t) G_{t}$$ 
$$\theta = \theta + \alpha \nabla_{\theta}U(\theta)$$

In [1]:
ENV = "CartPole-v1"
REWARD_MAX = 195

In [2]:
import gym
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

import torch
torch.manual_seed(0) # set random seed
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Categorical

In [3]:
import gym
env = gym.make(ENV)
device = torch.device("mps" if torch.has_mps else "cpu")

In [4]:
obs_size = env.observation_space.shape[0] 
n_actions = env.action_space.n  

class Policy(nn.Module):
    def __init__(self, s_size=obs_size, h_size=16, a_size=n_actions):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(s_size, h_size)
        self.fc2 = nn.Linear(h_size, a_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x, dim=1)
    
    def act(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0).to(device)
        probs = self.forward(state).cpu()
        m = Categorical(probs)
        action = m.sample()
        return action.item(), m.log_prob(action)

In [5]:
policy = Policy().to(device)
optimizer = optim.Adam(policy.parameters(), lr=1e-2)

def reinforce(n_episodes=1000, max_t=1000, gamma=1.0, print_every=100):
    scores_deque = deque(maxlen=100)
    scores = []
    for i_episode in range(1, n_episodes+1):
        saved_log_probs = []
        rewards = []
        state, _ = env.reset()
        for t in range(max_t):
            action, log_prob = policy.act(state)
            saved_log_probs.append(log_prob)
            state, reward, done, _, info = env.step(action)
            rewards.append(reward)
            if done:
                break 
        scores_deque.append(sum(rewards))
        scores.append(sum(rewards))
        
        discounts = [gamma**i for i in range(len(rewards)+1)]
        R = sum([a*b for a,b in zip(discounts, rewards)])
        
        policy_loss = []
        for log_prob in saved_log_probs:
            policy_loss.append(-log_prob * R)
        policy_loss = torch.cat(policy_loss).sum()
        
        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)), end="")
        if i_episode % print_every == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
            torch.save(policy.state_dict(), f'{ENV}.pth')
        if np.mean(scores_deque)>=REWARD_MAX:
            print('Environment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_deque)))
            torch.save(policy.state_dict(), f'{ENV}.pth')
            break
        
    return scores
    
scores = reinforce()

  if not isinstance(terminated, (bool, np.bool8)):


Episode 100	Average Score: 20.25
Episode 200	Average Score: 46.92
Episode 300	Average Score: 76.51
Episode 400	Average Score: 85.48
Episode 500	Average Score: 106.72
Episode 600	Average Score: 119.34
Episode 700	Average Score: 71.195
Episode 800	Average Score: 87.87
Episode 900	Average Score: 68.45
Episode 1000	Average Score: 69.38


* Hence very inefficient and not used. 
* There is no clear credit assignment. A trajectory may contain many good/bad actions and whether or not these actions are reinforced depends only on the final total output.

Motivation for PPO
* Because we have a Markov process, the action at time-step t, can only affect the future reward, so the past reward shouldn’t be contributing to the policy gradient.
* The easiest option to reduce the noise in the gradient is to simply sample more trajectories! Using distributed computing, we can collect multiple trajectories in parallel, so that it won’t take too much time. Then we can estimate the policy gradient by averaging across all the different trajectories


