# Problem Formulation and Notation 

Reference:
+ https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#key-concepts-and-terminology
+ https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html

## I. Notations:
### 1. Policy:
A policy is a rule used by an agent to decide what actions to take:
+ It can be deterministic, denoted as: $a_t=\mu(s_t)$
+ It may be stochastic, often denoted as: $a_t \sim \pi(\cdot | s_t)$

In Deep RL, a policy is parameterized by an Neural Network parameters $\theta$, hence we write the policy as:
+ deterministic: $a_t=\mu_{\theta}(s_t)$
+ stochastic: $a_t \sim \pi_{\theta}(\cdot | s_t)$. 
  
### 2. Trajectory: 
A trajectory $\tau$ is a sequence of states $s_t$ and actions $a_t$ in the world: $\tau=(s_0,a_0,s_1,a_1,...)$:
+ state $s_0$ is random assigned when starting env: $s_0 \sim \rho_0$.
+ state $s_t$ depends only on the current state and action. Assuming that it is stochastic with *transition probability*: $s_t \sim P(\cdot|s_t,a_t)$. 
+ assuming action $a_t$ is ruled by the stochastic policy: $a_t \sim \pi_\theta(\cdot|s_t)$.
  
Then, the probability of T-step trajectory is:
    $$P(\tau|\pi) = \rho_0(s_0) \prod_{t=0}^{T-1}P(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)$$

### 3. Reward-Return and the Objective of RL:
+ *reward* is received after the agent performs an action $a_t$ at the state $s_t$: $r_t=R(s_t,a_t)$.
+ *return of a trajectory* $\tau$ is the sum of collected rewards $r_t$ weighted by a discount factor $\gamma$: $R(\tau) = \sum_{t=0}^T \gamma^t r_t$.
+ *the goal of RL* is to select an optimal policy $\pi^*$ which **maximizes expected return $J(\pi)$** when agent acts according to it:
    $$\pi^*= \argmax_\pi J(\pi) \quad \text{where} \quad J(\pi)= \mathbb{E}_{\tau \sim \pi} [R(\tau)] = \int_\tau P(\tau|\pi) R(\tau) $$
It is often intimidate when seeing $\mathbb{E}_{\tau \sim \pi}$ and $\int_\tau$ in the equation. However, they are one thing but writen in different form for different use case:
   + Use $\mathbb{E}_{\tau \sim \pi}$ when we want to emphasize the intuitive meaning, that is the expected value when sampling $\tau$ according to a distribution $\pi$.
   + Use $\int_\tau P(\tau|\pi)$ when we want to use some magic math to formulate and solve the equation. 
   + Use $\frac{1}{T}\sum_{t=0}^T R(\tau_t)$ when we actually implement it by taking the average.   
### 4. Value function: 
If we start from a state $s$ or state-action pair $(s,a)$, we denote the **the value** of $s$ or $(s,a)$ as the expected returns when we strictly follow the policy $\pi$ forever after:
+ *Value function*: $V^{\pi}(s) = \mathbb{E}_{\tau \sim \pi} [R(\tau) | s_0=s]$ 
+ *Action-Value function*: $Q^{\pi}(s,a) = \mathbb{E}_{\tau \sim \pi} [R(\tau) | s_0=s, a_0=a]$ 
+ Connection between $V^{\pi}(s)$ and $Q^{\pi}(s,a)$:
    $$ V^{\pi}(s) = \mathbb{E}_{\tau \sim \pi} [R(\tau) | s_0=s] = \int_a \pi(a|s_0) \mathbb{E}_{\tau \sim \pi} [R(\tau) | s_0=s, a_0=a] = \mathbb{E}_{a \sim \pi}[Q^{\pi}(s,a)]$$
+ *Advantage function*: describe how much better to take a specific action $a$ in state $s$, over randomly selecting an action according to $\pi(\cdot|s)$: 
    $$ A^{\pi}(s,a)= Q^\pi(s,a) - V^{\pi}(s) $$
  Intuitively, the Action-Value Q-function depends on two factors:
    + The value of the underlying state $s$: In some state, all actions are bad. You lose no matter action you take.
    + The action you act on the state: Some action may be better than the others. 
    + Advantage value: tell you how much better an action compared to an expected value when taking random action.  
+ If we follow the optimal policy at the state $s$, we call it as *Optimal Value function* $V^*(s)$.
+ If we follow the optimal policy after taking an action $a$ at the state $s$, we call it as *Optimal Action-Value function* $Q^*(s,a)$. 
+ Their connection is: $ V^*(s)= \max_a  Q^*(s,a)$.
+ The optimal action: $a^*(s) = \argmax_a Q^*(s,a)$.

## II. Kind of RL Algorithms:
RL Algorithms can be divided into:
+ Model-Free
+ Model-based methods. 
  
Model-Free methods can then divided into two-main approaches:
  + **On-Policy (or Policy Optimization)**: We learn the optimal policy $\pi_\theta(\cdot|s)$ directly, and always act according to the policy. The procedure generally includes :
    + Initialize random policy  $\pi_\theta(\cdot|s)$.
    + For every n-step in each episode, do:
      + Follow the policy $\pi_\theta(a|s)$ to select action at each step $t$, and collect the rewards $r_t$.
      + After n-steps, compute the return $R(\tau)$, and the loss value $J(\pi_\theta)$. Update the policy by SGD: 
        $$\theta \leftarrow \theta + \nabla_\theta J(\pi_\theta)$$
  + **Off-Policy (or Q-learning)**: We learn the Action Value function $Q_\phi(s,a)$, and select the optimal action that maximize $a^*(s) = \argmax_a Q_{\phi}(s,a)$. There is no direct policy in here, and the procedure generally includes:
    + Initialize random Q-value function:  $Q_{\phi}(s,a)$.  
    + For every n-step in each episode, do:
      + Select an action that maximize $Q(s,a)$, but with probability $\epsilon$, select the action randomly. Add the data $(s_t,a_t,r_t,s_{t+1},done)$ to the data buffer.
      + Do sampling data from buffer, and update the $Q_\phi(s,a)$ network by SGD. 
        $$ \phi \leftarrow \phi + \nabla (Q_\phi(s,a)-Q_T(s,a))^2 $$
        where $Q_T(s,a)$ is the target of Q-Value, that we don't know and must use a boostraping technique to approximate it.  

Therefore, the difference between On-Policy and Off-Policy is that:
|  	| On-Policy 	| Off Policy 	|
|---	|---	|---	|
| Principle 	| - Strictly follow the policy in in n steps.  	| - There is no explicit policy. Instead, we approximate the Q-value, and indirectly infer the optimal action. |
| Update      |- Update the policy using the newest data collected in the last n-steps. |  - Update the Q-function using data randomly sampled from the buffer collected in all episodes, regardless the time order. 	|
| Advantage 	| - Stable and more intuitive by directly optimize the policy. 	| - Data efficient, and update the network faster if it works. 	|
| Disadvantage 	| - Slow convergence, and data insufficient 	| - Can be very unstable. There is no guarantee that if the Q-Value is optimized can lead to optimal policy.  $Q_T$ is generally a weak approximation.	|

  + **Actor-Critic** approaches combines the advantages of both On-Policy and Off-policy:
    + It has an Actor Network $\{\theta\}$ to approximate the Policy $\pi_\theta(a,s)$.
    + and a Critic Network $\{\phi\}$  to estimate the Q-value function $Q_{\phi}(s,a)$ or V-value function $V_{\phi}(s)$. 
    + The result is generally better than each individual one. In fact, most of the advanced algorithms employ the Actor-Critic structure, and only differs in the way of chosing action: using Policy (then it is On-Policy) or Q-function (then it is Off-Policy). In many cases, the boundary is not clear. 
  
The follow diagram illustrates the taxanomy of the common RL algorithms

<img src="https://spinningup.openai.com/en/latest/_images/rl_algorithms_9_15.svg"
     alt="RL taxanomy"
     width="800"
     style="float: left; margin-right: 20px;" />


## Next Steps:
+ [Deep Q-Learning](Q-Learning.ipynb): to learn basic of Off-Policy.
+ [Vanila Policy Gradient](Vanila_Policy_Optimization.ipynb): to learn basic of On-Policy 
+ [Config Usage and Write new Algorithm](Config_Usage.md): to learn how to use config file and write new algorithm.