### Policy Gradients

### University of Virginia
### Reinforcement Learning
### Last updated: October 9, 2023

---



### SOURCES 

- Reinforcement Learning, RS Sutton & AG Barto, 2nd edition. Chapter 13

### LEARNING OUTCOMES

- Understand policy gradient methods
- Understand a policy gradient algorithm: REINFORCE
- Understand extensions of REINFORCE: why is a baseline useful
- Understand why a critic is useful

### CONCEPTS

- policy gradient methods
- parametrizing the policy
- baseline
- actor-critic methods

### ORDERING

To follow Deep Q-Networks


---  

### Introducing Policy-Based Methods

Most of our work has been with value functions (e.g., action-value estimates). 

This is an indirect yet powerful way to tackle sequential prediction and control problems

Now we do something different: **directly model the policy and work to improve it**

What does this buy us?

- arguably more principled since policy methods directly optimize the policy parameters  
  this is more efficient in some cases

- allows us to model continuous action spaces

- can learn truly random stochastic policies  
  in some cases the optimal policy is a mix of two actions with fixed probabilities (bluff or not in Poker)


### Implementation of Policy-Based Methods

**Policy is parametrized**

Denote the *policy parameter* $\boldsymbol{\theta}$ which is a vector

The policy is parametrized as: 

$\pi(a|s,\boldsymbol{\theta}) = P(A_t=a | S_t=s, \boldsymbol{\theta_t}=\boldsymbol{\theta})$

The parametrization is flexible, but $\pi(a|s,\boldsymbol{\theta})$ must be differentiable wrt $\boldsymbol{\theta}$

The goal is to learn how to update the parameter components to improve performance measure $J$

**Gradient Ascent to make improvements**

We will follow the direction of the gradient to improve $\boldsymbol{\theta}$

The update equation has form:

$\boldsymbol{\theta_{t+1}} = \boldsymbol{\theta_t} + \alpha \hat{\nabla  J(\boldsymbol{\theta_t})} $

where 

- $\alpha$ is the step size (learning rate)

- $\hat{\nabla  J(\boldsymbol{\theta_t})}$ is a stochastic estimate of the gradient of the performance measure with respect to the parameters. 

On average, $\hat{\nabla  J(\boldsymbol{\theta_t})}$ should approach the true gradient.

It is common to use a softmax distribution for the policy:

$\pi(a|s,\boldsymbol{\theta}) = \frac{\exp{h(s,a,\boldsymbol{\theta})}}{\sum_b \exp{h(s,a,\boldsymbol{\theta})}}$

where $h(s,a,\boldsymbol{\theta})$ is a parametrized numerical preference given state $s$ and action $a$.

higher preference leads to higher probability

This is where modeling comes in, as $h(s,a,\boldsymbol{\theta})$ could be any sort of model (linear, deep neural network, ...).

Methods following this approach are called are policy gradient (PG) methods.

Methods that learn approximations to policy and value functions are called *actor-critic methods*.  
*Actor*: the learned policy  
*Critic*: the learned value function

---

### Our First Policy-Gradient Algorithm: REINFORCE

REINFORCE is a Monte Carlo Policy Gradient algorithm

Need to sample such that the expectation of the sample gradient is proportional to population gradient. 

The policy gradient theorem (see Sutton & Barto Ch 13 for details) gives an expression to obtain unbiased samples.

The update equation (excludes discounting), which uses stochastic gradient ascent, has form:

$\boldsymbol{\theta_{t+1}} = \boldsymbol{\theta_t} + \alpha G_t \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta_t})}{\pi(A_t|S_t, \boldsymbol{\theta_t})} $

where $G_t$ is the return of the trajectory

Since each trajectory must be fully simulated to compute $G_t$, this is a Monte Carlo method.

**The intuition behind the equation:**

- the parameter component update is proportional to two factors:  
  1) the product of the return   
  2) the gradient of the probability of taking the action divided by the probability of taking the action

- $G_t$ in the numerator will act to directly increase the update in that direction
- the probability of action $A_t$ in the denominator will act to decrease the update in that direction


---

#### REINFORCE Pseudocode

Inputs: 
- differentiable policy $\pi(a|s,\boldsymbol{\theta})$
- step size $\alpha$
- policy parameter $\boldsymbol{\theta_t}$

Initialize $\boldsymbol{\theta_t}$

Loop for each episode:  
- Generate an episode $S_0,A_0,R_1,...,S_{T-1},A_{T-1},R_T$  
- Loop for each time step $t=0,1,...,T-1$:

    - $ G = \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k$  (compute the discounted return of the trajectory)
    
    - $ \boldsymbol{\theta} = \boldsymbol{\theta} + \alpha \gamma^t G \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta})}{\pi(A_t|S_t, \boldsymbol{\theta})} $ (update the parameter vector using stochastic gradient ascent)

where the parameter update equation uses $\gamma^t$ for discounting

---

#### REINFORCE with Baseline

REINFORCE can have high variance and converge slowly.

Can be generalized to include a comparison of the action value to a baseline $b(s)$

The baseline can reduce variance and speed up learning, without introducing bias

Here is the update rule (excludes discounting):

$\boldsymbol{\theta_{t+1}} = \boldsymbol{\theta_t} + \alpha (G_t - b(s)) \frac{\nabla \pi(A_t|S_t, \boldsymbol{\theta_t})}{\pi(A_t|S_t, \boldsymbol{\theta_t})} $

A useful baseline is an estimate of the state value $\hat{v}(S_t, \textbf{w})$

---

The algorithm is very similar to REINFORCE but with these differences:
- introduce baseline function as state-value function
- step size for policy update and value function update
- compute baseline
- update equation for weights in state-vaue function

![reinforce](./reinforce_w_baseline0.png)

#### One-Step Actor-Critic

REINFORCE with baseline uses state-value function estimates based only on first state of each transition

In actor-critic methods, the state-value function is also applied to the second state transition.  
This allows for computing a one-step return, which can assess the action (acts as *critic*).

Overall policy-gradient method called *actor-critic* method.

This one-step method replaces full return of REINFORCE with one-step return. It is fully online and incremental.

![ac_one_step](./actor_critic_one_step0.png)

**Going Deeper**

We will dive further into Actor-Critic and other extensions next.  

If you are interested in policy gradients applied to precision medicine:

*Precision medicine as a control problem: Using simulation and deep reinforcement learning to discover adaptive, personalized multi-cytokine therapy for sepsis.* 

Brenden K. Petersen, Jiachen Yang, Will S. Grathwohl, Chase Cockrell, Claudio Santiago, Gary An, Daniel M. Faissol

https://arxiv.org/abs/1802.10440

---