Skip to content

Latest commit

 

History

History
34 lines (20 loc) · 897 Bytes

reinforce.rst

File metadata and controls

34 lines (20 loc) · 897 Bytes

REINFORCE

In REINFORCE we don't rely on an external critic. Instead, we use the full Monte-Carlo return as our feedback signal.

The parametrized policy πθ(a|s) is updated using the following policy gradients:


g(θ; St, At) = Gt ∇θlog πθ(At|St)

where Gt is the Monte-Carlo sampled return


Gt = Rt + γRt + 1 + γ2Rt + 2 + …

The sum is taken over all rewards up to the terminal state.

For more details, see section 13.3 of Sutton & Barto.


reinforce.py

Open in Google Colab

reinforce.py