In REINFORCE we don't rely on an external critic. Instead, we use the full Monte-Carlo return as our feedback signal.
The parametrized policy πθ(a|s) is updated using the following policy gradients:
g(θ; St, At) = Gt ∇θlog πθ(At|St)
where Gt is the Monte-Carlo sampled return
Gt = Rt + γ Rt + 1 + γ2Rt + 2 + …
The sum is taken over all rewards up to the terminal state.
For more details, see section 13.3 of Sutton & Barto.
reinforce.py
reinforce.py