REINFORCE

In REINFORCE we don't rely on an external critic. Instead, we use the full Monte-Carlo return as our feedback signal.

The parametrized policy π_θ(a|s) is updated using the following policy gradients:

g(θ; S_t, A_t) = G_t ∇_θlog π_θ(A_t|S_t)