# Intro to Policy Optimization

Reference: 
+ https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html
+ 


## I. Background 
From the [Problem Formulation](docs/ProblemFormulation_Notation.ipynb), *the goal of RL* is to select an optimal policy $\pi^*$ which **maximizes expected return $J(\pi)$** when agent acts according to it:
    $$\pi^*= \argmax_\pi J(\pi) \quad \text{where} \quad J(\pi)= \mathbb{E}_{\tau \sim \pi} [R(\tau)] = \int_\tau P(\tau|\pi) R(\tau) \quad\quad (1)$$

where:
+ $R(\tau) = \sum_{t=0}^T \gamma^t r_t$ is the accumulated returns. 
+ $P(\tau|\pi)$ is the probability of a T-step trajectory $\tau=(s_0,a_0,s_1,a_1,...)$ and is derived as:
    $$P(\tau|\pi) = \rho_0(s_0) \prod_{t=0}^{T-1}P(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t) \quad\quad (2)$$
  but the state transition $P(s_{t+1}|s_t,a_t)$ of the environment is **unknown**. 
+ $\pi_\theta(a_t|s_t)$ is the parameterized policy that we need to optimize through gradient descent(SGD): 
    $$\theta \leftarrow \theta + \nabla_\theta J(\pi_\theta)$$

## II. Solution 
To perform SGD update, we need to compute $\nabla_\theta J(\pi_\theta)$.
By definition and expand the expection into Integral, we have:
$$\nabla_\theta J(\pi_\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)] = \nabla_\theta \int_\tau P(\tau|\pi_\theta) R(\tau)\quad\quad (3)$$
+ We can bring the Gradient $\nabla_\theta$ inside the Integral $\int_\tau$ because they operate with different variables $\theta$ and $\tau$. 
    $$\nabla_\theta J(\pi_\theta) =  \int_\tau \nabla_\theta P(\tau|\pi_\theta) R(\tau) \quad\quad (4) $$
+ Notice that, $\nabla_\theta P(\tau|\pi_\theta)$ is hard to compute directly, but $\nabla_\theta \log(P(\tau|\pi_\theta))$ is much easier. From (2), we have:
    $$\nabla_\theta \log(P(\tau|\pi_\theta)) = \nabla_\theta [\log \rho_0(s_0) + \sum_{t=0}^{T-1}\log P(s_{t+1}|s_t,a_t) + \sum_{t=0}^{T-1}\log\pi_\theta(a_t|s_t)]
                                             = \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t|s_t) \quad\quad (5)$$
+ From the derivative of log function, we also have
    $$\nabla_\theta \log(P(\tau|\pi_\theta)) = \frac{\nabla_\theta P(\tau|\pi_\theta)}{P(\tau|\pi_\theta)} \Rightarrow \nabla_\theta P(\tau|\pi_\theta)=P(\tau|\pi_\theta) \nabla_\theta \log(P(\tau|\pi_\theta)) \quad\quad (6)$$ 
+ Then, from (4),(5),(6) we have:
    $$ \nabla_\theta J(\pi_\theta) =  \int_\tau P(\tau|\pi_\theta) \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) = E_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) ]\quad\quad (6)$$

In summary, (6) is the expectation, which can be implemented by sample mean:
+ We collect a set of $N$ trajectories $\mathcal{D} = \{\tau_i\}_{i=1,...,N}$ where each trajectory is obtained by letting the agent act in the environment using the policy $\pi_{\theta}$.
+ The policy gradient in (6) can be estimated with:
  $$ \nabla_\theta J(\pi_\theta) = \frac{1}{N} \sum_{\tau \in \mathcal{D}}\sum_{t=0}^{t \in \tau}\nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) \quad\quad (7) $$

##  III. Modify the Loss function:
**Problem:** From (6), the gradient of the Objective (loss) function is expressed as:
    $$ \nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau) ] $$
It means that, taking a step with this gradient pushes up the log-probabilities of each action in proportion to $R(\tau)$, the sum of all rewards ever obtained.
Here, $R(\tau)$ is the weight for each action. However, the current formula uses the same weight for all action $a_t$, which is not effective. 

**Modification**: Objective function (6) can be generalized as:
    $$ \nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t|s_t) \Phi_t ]\quad\quad (7)$$
where $\Phi_t >0$ is the modified weight, and does not depend on $\theta$. $\Phi_t$ can be:
+ **Constant Return value**: 
  $$\Phi_t = R(\tau) = \sum_{t=0}^T \gamma^t r_t$$
+ **Reward-to-go**, which is the return collected after the time $t$ only, since the weight of the action should only depend on the return achieved **after performing the action**, not before it: 
  $$\Phi_t = R_t = \sum_{t'\geq t} \gamma^{t'} r_{t'}$$
+ **Baseline value**: 
  $$\Phi_t = R_t - b(s_t)= \sum_{t'\geq t} \gamma^{t'} r_{t'} -V_\phi(s_t)$$
  where $V_\phi(s_t)$ is the approximation of value-function $V^\pi(s_t)$, and is estimated by a small network $\phi_t$, such as:
  $$\phi_t = \argmin_\phi \mathbb{E}_{s_t,R_t \sim \pi_k}\left[(V_\phi(s_t)-R_t)^2 \right]$$

