<h1>Policy Gradient Methods</h1>

# 0. The Policy Gradient Theorem

## Parameterized Policy

Let $\theta \in \mathbb{R}^{d'}$ be the policy's parameter vector.

$J(\theta)$ is some of the performance measure with respect to the policy parameter.

Methods that seek to maximize the performance are called <b>policy  gradient methods</b>, and the update rule of parameter is, 

> $\theta_{t+1} = \theta_{t} + \alpha \widehat{\nabla J(\theta)}$

$\widehat{\nabla J(\theta)}$ is a stochastic estimate whose expectation approximates the gradient of the performance measure with respect to its argument $\theta_t$.

## Performance

The performance is defined as the value of the start (some specific but not random node $s_0$) of the episode.

> $J(\theta) = v_{\pi_{\theta}}(s_0)$

## Policy Gradient Theorem

> $\nabla J(\theta) \propto \sum_{s} \mu(s) \sum_{a}q_{\pi}(s, a)\nabla_{\theta} \pi(a|s, \theta)$

+ Proof

> $\nabla v_{\pi}(s) = \nabla \left[ \sum_{a} \pi(a|s) q_{\pi}(s, a) \right], \forall s \in \mathcal{S}$

> $= \sum_{a} [\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \nabla q_{\pi}(s|a)] $ (product rule)

> $= \sum_{a} [\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \nabla \sum_{s', r} p(s', r|s, a) (r + v_{\pi}(s'))] $

> $= \sum_{a} [\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'} p(s'|s, a) (\nabla v_{\pi}(s'))] $

> $= \sum_{a} [\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'} p(s'|s, a) $

>> $\sum_{a'} [\nabla \pi(a'|s') q_{\pi}(s', a') + \pi(a'|s') \sum_{s''} p(s''|s', a') (\nabla v_{\pi}(s''))] ]$

> $= \sum_{a} [\nabla \pi(a|s) q_{\pi}(s, a) + \pi(a|s) \sum_{s'} p(s'|s, a) $

>> $\sum_{a'} [\nabla \pi(a'|s') q_{\pi}(s', a') + \pi(a'|s') \sum_{s''} p(s''|s', a') $

>>> $\sum_{a''} [\nabla \pi(a''|s'') q_{\pi}(s'', a'') + \pi(a''|s'') \sum_{s'''} p(s'''|s'', a'') \nabla v_{\pi}(s''')]]]$

> $= \sum_{a}\nabla \pi(a|s)q_{\pi}(s, a) + $

>> $\sum_{a} \pi({a|s}) \sum_{s'}p(s'|s, a)  \left( \sum_{a'}\nabla \pi(a'|s')q_{\pi}(s', a') \right)+  $

>>> $\sum_{a} \pi({a|s}) \sum_{s'}p(s'|s, a)  \sum_{a'} \pi(a'|s') \sum_{s''}p(s''|s', a') \left( \sum_{a''} \nabla \pi(a''|s'')q_{\pi}(s'', a'') \right) + $

>>>> $...$

> $=\sum_{x} Pr(s \rightarrow x, 0, \pi) \sum_{a}\nabla \pi(a|s)q_{\pi}(s, a) + $

>> $\sum_{x} Pr(s \rightarrow x, 1, \pi) \sum_{a}\nabla \pi(a|s)q_{\pi}(s, a) + $

>>> $\sum_{x} Pr(s \rightarrow x, 2, \pi) \sum_{a}\nabla \pi(a|s)q_{\pi}(s, a) + $

>>>> $...$

> $= \sum_{x \in \mathcal{S}} \sum_{k=0}^{\infty}Pr(s \rightarrow x, k, \pi)\sum_{a}\nabla \pi(a|s)q_{\pi}(s, a)$

$Pr(s \rightarrow x, k, \pi)$ is the probability of transitioning from state $s$ to state $x$ in $k$ steps under policy $\pi$.

Thus, 

> $\nabla J(\theta) = \nabla v_{\pi}(s_0)$

> $=\sum_{s}\left( \sum_{k=0}^{\infty} Pr(s_0 \rightarrow s, k, \pi) \right)
\sum_{a} \nabla \pi(a|s) q_{\pi}(s, a)$

> $= \sum_s \eta(s) \sum_{a} \nabla \pi(a|s) q_{\pi}(s, a)$

$\eta(s)$ is the expected average time of visiting the state $s$ in a single episode.

Let $h(s)$ denote the probability that an episode starts in $s$.

Time is spent on the state $s$ if the episode starts in $s$, or transition into $s$ from a preceding state $s'$, 

> $\eta(s) = h(s) + \sum_{s'}\eta(s') \sum_{a} \pi(a|s')p(s|s', a), \forall s \in \mathcal{S}$

The fraction of time spent on $s$ is normalized as, 

> $\mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')}, \forall s \in \mathcal{S}$

> $\nabla J(\theta) = \sum_{s}\eta(s) \sum_{a} \nabla \pi(a|s) q_{\pi}(s, a)$

> $= \sum_{s} \left( \mu(s) \sum_{s'}\eta(s') \right) \sum_{a} \nabla \pi(a|s) q_{\pi}(s, a)$

> $= \left( \sum_{s'}\eta(s') \right) 
\sum_{s} \mu(s) \sum_{a} \nabla \pi(a|s) q_{\pi}(s, a)$

> $\propto \sum_{s} \mu(s) \sum_{a} \nabla \pi(a|s) q_{\pi}(s, a)$

# 1. REINFORCE, A Monte-Carlo Policy-Gradient Method (episodic)

## Algorithm

+ Input: a differentiable policy parameterization $\pi(a|s, \theta)$

+ Initialize policy parameter $\theta \in \mathbb{R}^{d'}$

+ Repeat forever:

> Generate an episode $S_0, A_0, R_1, ..., S_{T-1}, A_{T-1}, R_T$, following $\pi(\centerdot|\centerdot, \theta)$

> For each step of the episode $t = 0, 1, ..., T-1$:

>> $G \leftarrow$ return from step $t$

>> $\theta \leftarrow \theta + \alpha \gamma^t G \nabla_{\theta} \ln \pi(A_t|S_t, \theta)$

## Induction

> $\nabla J(\theta) \propto \sum_{s} \mu(s) \sum_{a} \nabla \pi(a|s) q_{\pi}(s, a)$

> $= \mathbb{E}_{\pi}\left[ \sum_{a} \nabla \pi(a|s) q_{\pi}(s, a) \right]$

> $= \mathbb{E}_{\pi}\left[ \sum_{a} \nabla \pi(a|S_t) q_{\pi}(S_t, a) \right]$

> $= \mathbb{E}_{\pi}\left[ \sum_{a} \pi(a|S_t) 
\frac{\nabla \pi(a|S_t)}{\pi(a|S_t) }
q_{\pi}(S_t, a) \right]$

> $= \mathbb{E}_{\pi}\left[ 
\frac{\nabla \pi(A_t|S_t)}{\pi(A_t|S_t) }
q_{\pi}(S_t, A_t) \right]$, (replace $a$ with a sample $A_t \sim \pi $)

> $= \mathbb{E}_{\pi}\left[ 
\frac{\nabla \pi(A_t|S_t)}{\pi(A_t|S_t) }
G_t \right]$, ($\mathbb{E}_{\pi}[q_{\pi}(S_t, A_t)] = G_t$)

+ Update rule

> $\theta_{t+1} = \theta_t + \alpha \frac{\nabla \pi(A_t|S_t)}{\pi(A_t|S_t) }
G_t$

> $= \theta_t + \alpha G_t \nabla \ln \pi(A_t|S_t)$, 
($\nabla \ln x = \frac{\nabla x}{x}$)