### Vanilla Policy Gradient ###

In this notebook we will implement vanilla policy gradient (REINFORCE), a classic policy gradient reinforcement learning algorithm. The goal of reinforcement learning is for an agent to learn to act in a dynamic environment so as to maximize its expected cumulative reward over the course of a time-horizon. Policy gradient methods solve the problem of control by directly learning the policy $\pi: \mathcal{S} \rightarrow \mathcal{A}$ from observations of rewards obtained by interacting with the environment. 



Formally define a trajectory $\tau$ as a tuple $(s_0, a_0, r_0, s_1, a_1, ..., r_T)$ denoting a sequence of state-action-rewards observed over the course of some episode of interaction with the agent's environment, and let $R(\tau)$ denote the finite-horizon return, aka cumulative sum of rewards. Then our goal is the maximize the _expected_ finite-horizon return where the expectation is over trajectories sampled from the stochastic policy $\pi_{\theta}$ (here we let $\theta$ denote the parameters of the policy $\pi$)--i.e.

\begin{equation}
max_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} [R(\tau)]
\end{equation}

In order to optimize $J$ using gradient ascent, we need a computable form of its gradient. 

Skipping the derivation for now, the gradient of the expected return is

\begin{equation}
    \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} [\sum_{t=0}^T \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) R(\tau)]
\end{equation}

The log probability of each action is weighted by the rewards associated with it. The gradient has an intuitive interpretation--it encourages us to increase the probability of actions which lead to high expected return.

Conveniently, $\nabla_{\theta} J$ turns out to have the form of an expectation. Because of this, we can estimate it using samples from our environment--i.e.

\begin{equation}
\hat{g} = \dfrac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^T \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) R(\tau)
\end{equation}

Where $\mathcal{D}$ is a dataset of trajectories.

### Policy Network ###

The policy is represented as a multi-layer perceptron so that we can learn to act in environments with high dimensional states and/or continous action spaces.

In [6]:
from collections import OrderedDict
import torch
import numpy as np
import torch.nn as nn

def build_mlp(input_size, output_size, n_layers, size):    
    modules = OrderedDict()
    modules['Linear_Input'] = nn.Linear(input_size, size)
    modules['ReLU_Input'] = nn.ReLU()
    for i in range(n_layers):
        modules['Linear_'+str(i)] = nn.Linear(size, size)
        modules['ReLU_'+str(i)] = nn.ReLU()
    modules['Linear_Output'] = nn.Linear(size,output_size)
    sequential = nn.Sequential(modules)
    return sequential

# cartpole
observation_dim        = 4
n_layers               = 1
layer_size             = 64

policy_network = build_mlp(observation_dim, 1, n_layers, layer_size)

ModuleNotFoundError: No module named 'torch'