# Policy Gradient

> This is the summary of lecture "Prediction and Control with Function Approximation" from Coursera.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Coursera, Reinforcement_Learning]
- image: 

## Learning Policies Directly

### Parameterizing Policy Directly

![](image/parameter_policy.png)

### Constraints on the Policy Parameterization

$\pi(a \vert s, \theta) \ge 0 \quad \text{for all } a \in \mathcal{A} \text{ and } s \in \mathcal{S} $

$ \sum\limits_{a \in \mathcal{A}} \pi( a \vert s, \theta) = 1) \quad \text{ for all } s \in \mathcal{S} $

### The softmax Policy Parameterization

$ \pi(a \vert s, \theta) \doteq \frac{e^{h(s, a, \theta)}}{ \sum_{b \in \mathcal{A}}e^{h(s, b, \theta)}} $

$ e^{h(s, a, \theta)}$ is called Action Preference

### Action preferences are not action values

![](image/action_preference.png)

## The Objective for Learning Policies

### Formalizing the Goal as an objective

- Episodic

$ G_t = \sum\limits_{t=0}^{T} R_t $

- Continuing

$G_t = \sum\limits_{t=0}^{\infty} \gamma^t R_t \text{ (original) } \\
 G_t = \sum\limits_{t=0}^{\infty}  R_t - r(\pi) \text{ (applying Average reward) } $
 
### Optimizing the Average Reward Objective

$\nabla r(\pi) = \nabla \sum_s \mu(s) \sum_a \pi(a \vert s, \theta) \sum_{s', r} p(s', r \vert s, a) r $

This is called **policy-gradient** method.

### The Challenge of Policy Gradient Method

$\mu(s)$ depends on $\theta$

$\begin{aligned} \nabla_w \overline{VE} &= \nabla_w \sum_s \mu(s)[v_{\pi}(s) - \hat{v}(s, w)]^2 \\ &= \sum_s \mu(s) \nabla_w [v_{\pi}(s) - \hat{v}(s, w)]^2 \end{aligned}$

## The Policy Gradient Theorem

### The Gradient of the Objective

$\text{Product Rule: } \nabla(f(x) g(x)) = \nabla f(x)g(x) + f(x) \nabla g(x) $

$\begin{aligned} \nabla r(\pi) &= \nabla \sum_{s} \mu(s) \sum_a \pi(a \vert s, \theta) \sum_{s', r}p(s', r \vert s, a) r \\ 
 &= \sum_s \nabla \mu(s) \sum_a \pi(a \vert s, \theta) \sum_{s', r}p(s', r \vert s, a) r \\ &+ \sum_s \mu(s) \nabla \sum_a \pi(a \vert s, \theta) \sum_{s', r} p(s', r \vert s, a) r \end{aligned}$
 
### Policy Gradient Theorem

$\nabla r(\pi) = \sum_s \mu(s) \sum_a \nabla \pi(a \vert s, \theta) q_{\pi}(s, a) $

## Estimating the policy gradient

### Getting Stochastic samples of the gradient

$\nabla r(\pi) = \sum\limits_s \mu(s) \sum\limits_a \nabla \pi (s \vert s, \theta) q_{\pi}(s, a) $

$ \theta_{t+1} \doteq \theta_t \alpha \sum\limits_a \nabla \pi(a \vert S_t, \theta_t) q_{\pi}(S_t, a) $

### Unbiasedness of the stochastic samples

$ \begin{aligned} \nabla r(\pi) &= \sum\limits_s \mu(s) \sum\limits_a \nabla \pi(a \vert s, \theta) q_{\pi}(s, a) \\
 &= \mathbb{E}_{\mu} [ \sum\limits_a \nabla \pi(a \vert S, \theta) q_{\pi}(S, a) ] \end{aligned} $
 
$\mu$ is the stationary distribution for $\pi$ which reflects state visitation under $\pi$. By computing the gradient from a state $S_t$ which is populated from $mu$, we get an unbiased estimate of this expectation.

### Getting Stochastic Samples with one action

$\begin{aligned} \sum\limits_a \nabla \pi (a \vert S, \theta) q_{\pi}(S, a) &= 
 \sum\limits_a \pi(s \vert S, \theta) \frac{1}{\pi(a \vert S, \theta)} \nabla \pi(a \vert S, \theta) q_{\pi}(S, a) \\
 &= \mathbb{E}_{\pi}[\frac{\nabla \pi(A \vert S, \theta)}{\pi(A \vert S, \theta)} q_{\pi}(S, A) ] \end{aligned} $
 
### Stochastic Gradient Ascent for policy parameters

$ \begin{aligned} \theta_{t+1} &\doteq \theta_t + \alpha \frac{\nabla \pi (A_t \vert S_t, \theta_t)}{\pi(A_t \vert S_t, \theta_t)} q_{\pi}(S_t, A_t) \\ 
 &= \theta_t + \alpha \nabla \ln \pi(A_t \vert S_t, \theta_t) q_{\pi}(S_t \vert A_t) \end{aligned} $
 
Note that,

$\nabla \ln \big(f(x)\big) = \frac{\nabla f(x)}{f(x)} $

### Computing the update

$ \theta_{t+1} \doteq \theta_t + \alpha \nabla \ln \pi(A_t \vert S_t, \theta_t) q_{\pi}(S_t, A_t) $

## Actor-Critic Algorithm

### Approximating the Action Value in the policy update

Using one step boot-strapping,

$ \begin{aligned} \theta_{t+1} &\doteq + \alpha \nabla \ln \pi(A_t \vert S_t, \theta_t) q_{\pi}(S_t \vert A_t) \\
 &= \theta_t + \alpha \nabla \ln \pi(A_t \vert S_t, \theta_t)[ R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, w)] \end{aligned} $
 
![ac](image/actor_critic.png)

### Subtracting the current state's value estimate

$ \theta_{t+1} \doteq \theta_t + \alpha \nabla \ln \pi(A_t \vert S_t, \theta_t) [ R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w)] $

$R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w)$ is TD error $\delta$, and it does not affect the expected update.

### Adding a baseline

$\mathbb{E}_{\pi}[ \nabla \ln \pi(A_t \vert S_t, \theta_t) [R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w)] \vert S_t = s ] \\
= \mathbb{E}_{\pi}[ \nabla \ln \pi(A_t \vert S_t, \theta_t) [ R_{t+1} - \bar{R} + \hat{v}(S_{t+1}, w)] \vert S_t = s] - \mathbb{E}_{\pi} [ \nabla \ln \pi (A_t \vert S_t, \theta_t) \hat{v}(S_t, w) \vert S_t = s] $

Baseline term ($\mathbb{E}_{\pi}[ \nabla \ln \pi(A_t \vert S_t, \theta_t) \hat{v}(S_t, w) \vert S_t = s$) is 0. But we can reduce the update variance through it, which results in faster learning.

### How the actor and the critic interact

![](image/actor_critic2.png)

$ \theta_{t+1} \doteq \theta_t + \alpha \nabla \ln \pi (A_t \vert S_t, \theta_t) \delta_t $

### Actor-Critic (continuing), for estimating $\pi_{\theta} \approx \pi_{*}$

$\begin{aligned}
&\text{Input: a differentiable policy parameterization } \pi(a \vert s, \theta) \\
&\text{Input: a differentiable state-value function parameterization } \hat{v}(S, w) \\
&\text{Initialize } \bar{R} \in \mathbb{R} \text{ to } 0 \\
&\text{Initialize state-value weights } w \in \mathbb{R}^d \text{ and policy parameter } \theta \in \mathbb{R}^d \text{ (e.g., to } 0 ) \\
&\text{Algorithm parameters: } \alpha^w > 0, \alpha^{\theta} > 0, \alpha^{\bar{R}} > 0 \\
&\text{Initialize } S \in \mathcal{S} \\
&\text{Loop forever (for each time step):} \\
&\quad A \sim \pi(\cdot \vert S, \theta) \\
&\quad \text{Take action } A, \text{ observe } S', R \\
&\quad \delta \leftarrow R - \bar{R} + \hat{v}(s', w) - \hat{v}(S, w) \\ 
&\quad \bar{R} \leftarrow \bar{R} + \alpha^{\bar{R}} \delta \\
&\quad w \leftarrow w + \alpha^w \delta \nabla \hat{v}(S, w) \\
&\quad \theta \leftarrow \theta + \alpha^{\theta} \delta \nabla \ln \pi(A \vert S, \theta) \\
&\quad S \leftarrow S'
\end{aligned}$

## Actor-Critic with Softmax Policies

### Recap - Actor-Critic

$ w \leftarrow w + \alpha^w \delta \nabla \hat{v}(S, w) $

$ \theta \leftarrow \theta + \alpha^{\theta} \delta \nabla \ln \pi(A \vert S, \theta) $

### Policy update with a softmax policy

$ \theta \leftarrow \theta + \alpha^{\theta} \delta \nabla \ln \pi(A \vert S, \theta) $

$ \pi(a \vert s, \theta) \doteq \frac{e^{h(s, a, \theta}}{\sum_{b \in \mathcal{A}} e^{h(s, b, \theta)}} $

### Features of the Action preference function

$ \hat{v}(s, w) \doteq w^T x(s) $

$ h(s, a, \theta) \doteq \theta^T x_h(s, a) $

### Actor-Critic with a softmax policy

Critic's update:

$ w \leftarrow w + \alpha^w \delta \nabla \hat{v}(S, w) $

The gradient of linear value function is just the feature vector

$ \nabla \hat{v}(s, w) = x(s) $

So the critic's weight update is like this,

$ w \leftarrow w + \alpha^w \delta x(S) $

Actor's update:

$ \theta \leftarrow \theta + \alpha^{\theta} \delta \nabla \ln \pi(A \vert S, \theta) $

$ \nabla \ln \pi(a \vert s, \theta) = x_h(s, a) - \sum\limits_b \pi(b \vert s, \theta) x_h(s, b) $

## Gaussian Policies for Continuous Actions


