## Policy Gradient

### Policy Objective Function
- Goal: given policy $\pi_\theta(s,a)$ with parameter $\theta$, find best $\theta$
- How to measture the quality of a policy $\pi_\theta$ ?
  - In episodic environments we can use the start value
$$
J_1(\theta) = V^{\pi_\theta}(s_1) = \mathbb{E}_{\pi_\theta}[v_1]
$$
    - 첫번째 state의 value
    - 시작 state는 하나 혹은 고정된 분포
  - In continious environments we can use the average value
$$
J_{av}(\theta) = \sum_s d^{\pi_\theta}(s) V^{\pi_\theta}(s)
$$
    - state에 있을 확률과 state의 value의 곱의 총합
  - Or the average reward per time-stamp
$$
J_{avR}(\theta) = \sum_s d^{\pi_\theta}(s) \sum_a \pi_\theta(s,a) R^a_s
$$
    - state에 있을 확률과 state에서 각 action을 했을 때의 reward와 action의 확률(policy)의 곱의 총합
  - where $d^{\pi_\theta}(s)$ is stationary distribution of Markov chain for $\pi_\theta(s,a)$

### Policy Optimisation
- Find $\theta$ that maximises $J(\theta)$

### Policy Gradient
$$
\bigtriangleup \theta = \alpha \bigtriangledown_\theta J(\theta)
$$

## Monte-Carlo Policy Gradient

### Score Function
- Assume policy $\pi_\theta$ is differentiable whenever it is non-zero
- and we know the gradient $\bigtriangledown_\theta \pi_\theta(s,a)$
- Likelihood ratios explits the following identity
$$
\begin{align}
\bigtriangledown_\theta \pi_\theta(s,a) & = \pi_\theta(s,a) \frac{\bigtriangledown_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} \\
& = \pi_\theta(s,a) \bigtriangledown_\theta \log \pi_\theta(s,a)
\end{align}
$$
- The score function is $\bigtriangledown_\theta \log \pi_\theta(s,a)$

### One-Step MDP
- Consider a simple class of one-step MDPs
  - Starting in state $s \sim d(s)$
  - Terminating after one time-step with reward $r=R_{s,a}$
- Use likelihood ratios to compute the policy gradient
$$
\begin{align}
J(\theta) & = \mathbb{E}_{\pi_\theta}[r] \\
 & = \sum_{s \in S} d(s) \sum_{a \in A} \pi_\theta(s,a)R_{s,a} \\
\bigtriangledown_\theta J(\theta) & = \sum_{s \in S} d(s) \sum_{a \in A} \pi_\theta (s,a) \bigtriangledown_\theta \log \pi_\theta (s,a) R_{s,a} \\
& = \mathbb{E}_{\pi_\theta}[\bigtriangledown_\theta \log \pi_\theta (s,a) r]
\end{align}
$$
- 기대값이므로 평균으로 대신할 수 있다.
- $\pi_\theta (s,a)$는 미분가능한 함수로 선택하면 된다.

### Softmax Policy
- We will use a softmax policy as a running example
- Weight actions using linear combination of features $\phi(s,a)^T \theta$
- Probability of action is proportional to exponentiated weight
$$
\pi_\theta (s,a) \propto e^{\phi(s,a)^T \theta}
$$
- The score function is
$$
\bigtriangledown_\theta \log \pi_\theta(s,a) = \phi(s,a) - \mathbb{E}_{\pi_\theta}[\phi(s,\cdot)]
$$

### Gaussian Policy
- In continious action spaces, a Gaussian policy is natural
- Mean is a linear combination of state feature $\mu(s) = \phi(s)^T \theta$
- Variance may be fixed $\sigma^2$, or can be parametrised
- Policy is Gaussian, $a \sim N(\mu(s), \sigma^2)$
- The score function is
$$
\bigtriangledown_\theta \log \pi_\theta(s,a) = \frac{(a-\mu(s))\phi(s)}{\sigma^2}
$$

### Policy Gradient Theorem
- 일반화
$$
\bigtriangledown_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\bigtriangledown_\theta \log \pi_\theta (s,a) Q^{\pi_\theta}(s,a)]
$$

### Monte-Carlo Policy Gradient (REINFORCE)
- Update parameters by stochastic gradient ascent
- Using policy gradient theorem
- Using return $v_t$ as an unbiased sample of $Q^{\pi_\theta}(s_t,a_t)$
$$
\bigtriangleup \theta_t = \alpha \bigtriangledown_\theta \pi_\theta(s_t,a_t) v_t
$$


$$
\begin{align}
& \text{function REINFORCE} \\
& \hspace{10mm} \text{Initialise }\theta\text{ arbitrarily} \\
& \hspace{10mm} \text{for each episode }\{s_1, a_1, r_2, \dots, s_{T-1}, a_{T-1}, r_T\} \sim \pi_\theta\text{ do}\\
& \hspace{20mm} \text{for }t = 1\text{ to }T - 1\text{ do}\\
& \hspace{30mm} \theta \leftarrow \theta + \alpha \bigtriangledown_\theta \log \pi_\theta(s_t,a_t)v_t\\
& \hspace{20mm} \text{end for }\\
& \hspace{10mm} \text{end for }\\
& \hspace{10mm} \text{return }\theta\\
& \text{end function}
\end{align}
$$

## Actor-Critic Policy Gradient

### Reducing Variace Using a Critic
- We use a critic to estimate the action-value function
$$
Q_w(s,a) \approx Q^{\pi_\theta}(s,a)
$$
- Actor-critic algorithm maintain two sets of parameters
  - Critic: Updates action-value function parameters $w$
  - Actor: Updates policy parameters $\theta$, in direction suggested by critic
- Actor-critic algorithms follow an approximate policy gradient
$$
\begin{align}
\bigtriangledown_\theta J(\theta) & \approx \mathbb{E}_{\pi_\theta}[\bigtriangledown_\theta \log \pi_\theta (s,a) Q_w(s,a)] \\
\bigtriangleup \theta & = \alpha \bigtriangledown_\theta \log \pi_\theta(s,a) Q_w(s,a)
\end{align}
$$

### Action-Value Actor-Critic
- Simple actor-critic algorithm based on action-value critic
- Using linear value fn approx. $Q_w(s,a) = \phi(s,a)^T w$
  - Critic: Updates $w$ by linear TD(0)
  - Actor: Updates $\theta$ by policy gradient
$$
\begin{align}
& \text{function QAC}\\
& \hspace{10mm}\text{Initialise s, }\theta \\
& \hspace{10mm}\text{Sample }a \sim \pi_\theta \\
& \hspace{10mm}\text{for each step do} \\
& \hspace{20mm}\text{Sample reward }r=R^a_s\text{; sample transition }s^\prime \sim P^a_{s,\cdot} \\
& \hspace{20mm}\text{Sample action }a^\prime \sim \pi_\theta(s^\prime, a^\prime) \\
& \hspace{20mm}\delta = r + \gamma Q_w(s^\prime, a^\prime) - Q_w(s, a) \\
& \hspace{20mm}\theta = \theta + \alpha \bigtriangledown_\theta \log \pi_\theta(s,a)Q_w(s,a) \\
& \hspace{20mm}w \leftarrow w + \beta \delta \phi(s,a) \\
& \hspace{20mm}a \leftarrow a^\prime, s \leftarrow s^\prime \\
& \hspace{10mm}\text{end for} \\
& \text{end function}
\end{align}
$$

### Reducing Variance Using a Baseline
- We substract a baseline function $B(s)$ from the policy gradient
- This can reduce variance, without changing expectation
$$
\begin{align}
\mathbb{E}_{\pi_\theta}[\bigtriangledown_\theta \log \pi_\theta(s,a) B(s)] & = \sum_{s \in S} d^{\pi_\theta}(s) \sum_a \bigtriangledown_\theta \pi_\theta(s,a) B(s) \\
& = \sum_{s \in S} d^{\pi_\theta}(s)  B(s) \bigtriangledown_\theta \sum_a \pi_\theta(s,a) \\
& = 0
\end{align}
$$
- A good baseline is the state value function $B(s) = V^{\pi_\theta}(s)$
- So we can rewrite the policy gradient using advantage function $A^{\pi_\theta}(s,a)$
$$
\begin{align}
A^{\pi_\theta}(s,a) & = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s) \\
\bigtriangledown_\theta J(\theta) & = \mathbb{E}_{\pi_\theta}[\bigtriangledown_\theta \log \pi_\theta(s,a) A^{\pi_\theta}(s,a)]
\end{align}
$$

### Estimating the Advantage Function
- Using two function approximators and two parameter vectors
$$
\begin{align}
V_v(s) & \approx V^{\pi_\theta}(s) \\
Q_w(s,a) & \approx Q^{\pi_\theta}(s,a) \\
A(s, a) & = Q_w(s,a) - V_v(s)
\end{align}
$$
- Critic을 위해서 $v, w$의 두 파라메터가 필요하다. 그러나 ...
- For the true value function $V^{\pi_\theta}(s)$, the TD error $\delta^{\pi_\theta}$
$$
\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s^\prime) - V^{\pi_\theta}(s)
$$
- is unbiased estimate of the advantage function
$$
\begin{align}
\mathbb{E}_{\pi_\theta}[\delta^{\pi_\theta}|s,a] & = \mathbb{E}_{\pi_\theta}[r + \gamma V^{\pi_\theta}(s^\prime)|s,a] - V^{\pi_\theta}(s) \\
& = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s) \\
& = A^{\pi_\theta}(s,a)
\end{align}
$$
- So we can use the TD error to compute the policy gradient
$$
\bigtriangledown_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\bigtriangledown_\theta \log \pi_\theta(s,a) \delta^{\pi_\theta}]
$$
- In practice we can use an approximate TD error
$$
\delta_v = r + \gamma V_v(s^\prime)-V_v(s)
$$
- This approach only requires one set of critic parameter $v$