# On-policy Prediction with Approximation

> This is the summary of lecture "Prediction and Control with Function Approximation" from Coursera.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Coursera, Reinforcement_Learning]
- image: 

## Moving to Parameterized Functions

### Parameterizing the Value Function

$$ \hat{v}(s, w) \approx v_{\pi}(s) $$

For example, suppose we have the state space in 2D dimension, $X, Y$. Then we can express our value function like this,

$$ \hat{v}(s, w) \doteq w_1 X + w_2 Y $$

### Linear Value Function Approximation

$$ \begin{aligned} \hat{v}(s, w) &\doteq \sum w_i x_i(s) \\ 
&= <w, x(s) > \end{aligned} $$

### Nonlinear Function Approximation

![nfa](image/nfa.png)

## Generalization and Discrimination

### Categorizing methods based on generalization and discrimination

![gene-desc](image/gene_desc.png)

## Framing Value Estimation as Supervised Learning

- The function approximator should be compatible with online updates (online means that the full dataset is available from the start and remains fixed throughout learning.)

- Thf function approximator should be compatible with Bootstrapping (usually, the target in RL problem depends on feature weight $w$)

## The Value Error Objective

### The Mean Squared Value Error Objective

$$\overline{VE} =  \sum_{s} \mu(s)[v_{\pi}(s) - \hat{v}(s, w)]^2 $$

Here, $\mu(s)$ is the fraction of time we spend in $S$ when following policy $\pi$

## Introducing Gradient Descent

### Gradient - Derivatives in Multiple Dimensions

$$ w \doteq \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_d \end{bmatrix} \nabla f \doteq \begin{bmatrix} \frac{\partial f}{\partial w_1} \\ \frac{\partial f}{\partial w_2} \\ \dots \\ \frac{\partial f}{\partial w_d} \end{bmatrix} $$

The sign indicates the direction to change $w$ in order to increase $f$. And the magnitude means how quickly $f$ changes.

The gradient gives the direction of steepest ascent.

## Gradient Monte for Policy Evaluation

### Gradient of the Mean Squared Value Error Objective

$\begin{aligned}  &\nabla \sum\limits_{s \in \mathcal{S}} \mu(s) [ v_{\pi}(s) - \hat{v}(s, w)]^2 \\ 
&= \sum\limits_{s \in \mathcal{S}} \mu(s) \nabla [v_{\pi}(s) - \hat{v}(s, w)]^2 \\
&= - \sum\limits_{s \in \mathcal{S}} \mu(s) 2 [v_{\pi}(s) - \hat{v}(s, w)] \nabla \hat{v}(s, w) \end{aligned}$

from previous definition,

$ \hat{v}(s, w) \doteq < w, x(s) > $

So, the gradient of value function is

$ \nabla \hat{v}(s, w) = x(s) $

As a result,

$ \Delta w \propto \sum\limits_{s \in \mathcal{S}} \mu(s) [v_{\pi}(s) - \hat{v}(s, w)] \nabla \hat{v}(s, w) $

### Gradient Monte Carlo

$$ w_{t+1} \doteq w_{t} + \alpha [ G_t - \hat{v}(S_t, w_t) ] \nabla \hat{v}(S_t, w_t) $$

Recall that

$$ v_{\pi}(s) \doteq \mathbb{E}_{\pi}[G_t \vert S_t = s] $$

$$ \begin{aligned} &\mathbb{E}_{\pi}\big[2 [v_{\pi}(S_t) - \hat{v}(S_t, w)] \nabla \hat{v}(S_t, w) \big] \\
&= \mathbb{E}_{\pi} \big[ 2[ G_t - \hat{v}(S_t, w)] \nabla \hat{v}(S_t, w)\big] \end{aligned} $$

### Gradient Monte Carlo Algorithm for Estimating $\hat{v} \approx v_{\pi}$

$\begin{aligned}
&\text{Input: the policy } \pi \text{ to be evaluated } \\
&\text{Input: a differentiable function: } \hat{v} : \mathcal{S} \times \mathbb{R}^d \rightarrow \mathbb{R} \\
&\text{Algorithm parameter: step size } \alpha > 0 \\
&\text{Initialize value-function weights } w \in \mathbb{R}^d \text{ arbitrarily (e.g., } w = 0\text{)} \\
\newline
&\text{Loop forever (for each episode):} \\
&\quad \text{Generate an episode } S_0, A_0, R_1, S_1, A_1, \dots, R_T, S_T \text{ using } \pi \\
&\quad \text{Loop for each step of episode, } t=0, 1, \dots, T-1: \\
&\qquad w \leftarrow w + \alpha[G_t - \hat{v}(S_t, w)] \nabla \hat{v}(S_t, w)
\end{aligned}$

## Semi-Gradient TD for Policy Evaluation

### The TD Update for Function Approximation

$w \leftarrow w + \alpha [ U_t - \hat{v}(S_t, w)] \nabla \hat{v}(S_t, w) $

If $U_t$ is unbiased, $w$ will coverage to a local optimum. But we can replace $U_t$ with bootstrap target (or one-step TD target).

$U_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, w)$

If we choose bootstrap target, this must be biased, since the TD target uses our current value estimate, which will likely not equal to the true value function. In this case, $w$ may not converge to a local optimum. But TD target usually has low variance than the sample of the return.

### Semi-gradient method of TD

$ \nabla \frac{1}{2} [U_t - \hat{v}(S_t, w)]^2 \\
= (U_t - \hat{v}(S_t, w))(\nabla U_t - \nabla \hat{v}(S_t, w)) \neq -(U_t - \hat{v}(S_t, w)) \nabla \hat{v}(S_t, w)$

Right term is the TD update, and the inequality will be satisfied when $\nabla U_t = 0$

But for TD, $\nabla U_t \neq 0$

$\begin{aligned} \nabla U_t &= \nabla(R_{t+1} + \gamma \hat{v}(S_{t+1}, w)) \\
&= \gamma \nabla \hat{v}(S_{t+1}, w) \\
&\neq 0 \end{aligned}$

So we cannot apply Gradient Descent on TD Learning, directly. This is called **semi-gradient** method.

### Semi-gradient TD(0) for estimating $\hat{v} \approx v_{\pi}$

$\begin{aligned}
&\text{Input: the policy } \pi \text{ to be evaluated } \\
&\text{Input: a differentiable function } \hat{v} : \mathcal{S}^{+} \times \mathbb{R}^d \to \mathbb{R} \text{ such that } \hat{v}(\text{terminal}, \cdot) = 0 \\
&\text{Algorithm parameter: step size } \alpha > 0 \\
&\text{Initialize value-function weights } w \in \mathbb{R}^d \text{ arbitrarily (e.g., } w = 0) \\
\newline
&\text{Loop for each episode:} \\
&\quad \text{Initialize } S \\
&\quad \text{Loop for each step of episode: } \\
&\qquad \text{Choose } A \sim \pi(\cdot \vert S) \\
&\qquad \text{Take action } A, \text{ observe } R, S' \\
&\qquad w \leftarrow w + \alpha [ R + \gamma \hat{v}(S', w) - \hat{v}(S, w)] \nabla \hat{v}(S, w) \\
&\qquad S \leftarrow S'\\
&\quad \text{until } S \text{ is terminal}
\end{aligned}$

## Building knowledge for AI agents with Reinforcement Learning (Doina Precup)

- Focusing on two types of knowledge
    - Procedural knowledge: policies, but also skills, goal-driven behavior
    - Predictive, empirical knowledge: Value function, but also models
    
- Knowledge of RL agents should be...
    - Expressive: able to represent many things, including abstractions like objects, space, people, and extended actions
    - Learnable: from data without labels or supervision (for scalability)
    - Composable: suitable for supporting planning / reasoning by assembling existing pieces.
    
- Two kinds of abstraction
    - Temporal abstraction: reasoning and generalization over many different time scales
    - State abstraction / function approximation: generalization over many different states
    
- Temporal abstraction and procedural knowledge - Options (Sutton, Precup & Singh, 1999)
    - An option $w$ consists of 3 components
        - An initiation set $I_w \subseteq \mathcal{S}$ (a.k.a precondition)
        - A policy $\pi_w: \mathcal{S} \times \mathcal{A} \to [0, 1], \pi_{w}(a \vert s)$ is the probability of taking $a$ in $s$ when following option $w$
        - A termination condition $\beta_{w} : \mathcal{S} \to [0, 1], \beta_{w}(s)$ is the probability of termination the option $w$ upon entering $s$
    - E.g., robot navigation: if there is no obstacle in front $(I_w)$, go forward $(\pi_w)$ until you get too close to another object $(\beta_w)$
    - Inspired from macro-actions / behaviors in robotics / hybrid planning and control
    
- Option models
    - Option model has two parts:
        1. Expected Reward $r_w(s)$: the expected return during $w$'s execution from state $s$
        2. Transition model $P_w(s' \vert s)$: specifies where the agent will end up after the option / program execution and when termination will happen
    - Models are predictions about the future, conditioned on the option being executed
    
- MDP + options = Semi-markov Decision Process
    - Introducing options in an MDP induces a related semi-MDP
    - Hence all planning and learning algorithm from classical MDPs transfer directly to options
    - But planning and learning with options can be much faster!

## The Linear TD

### TD Update with Linear Function Approximation

We define the TD error $\delta$,

$$ \delta_t \doteq R_{t+1} + \gamma \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w) $$

TD update,

$ w \leftarrow w + \alpha \delta_t \nabla \hat{v}(S_t, w)$

If we use linear function approximation,

$ \hat{v}(S_t, w) \doteq w^T x(S_t)$

Then, the gradient of function will be state itself,

$ \nabla \hat{v}(S_t, w) = x(S_t) $

After that, TD update will be like this in simple expression,

$ w \leftarrow w + \alpha \delta_t x(S_t)$

If it is well-designed, we can get effective value function approximation with a simple update.

## The true objective for TD

### The Expected TD update

$\begin{aligned} w_{t+1} &\doteq w_t + \alpha[R_{t+1} + \gamma \hat{v}(S_{t+1}, w_t) - \hat{v}(S_t, w_t)] x_t \\
 &= w_t + \alpha [ R_{t+1} + \gamma w_t^T x_{t+1} - w_t^T x_t] x_t \\
 &= w_t + \alpha [ R_{t+1}x_t - x_t(x_t - \gamma x_{t+1})^T w_t] 
 \end{aligned}$
 
We can express the expection of weight like this,

$ \mathbb{E}[\Delta w_t] = \alpha (\text{b} - \text{A}w_t)$

Then, each letter can be substituted,

$ \text{b } \doteq \mathbb{E}[R_{t+1} x_t] \quad \text{A } \doteq \mathbb{E}[x_t (x_t - \gamma x_{t+1})^T] $

### The TD Fixed Point

Linear TD can be converged if the expection is 0(in other words, the expected TD update is 0),

$$ \mathbb{E}[\Delta w_{TD}] = \alpha(\text{b } - \text{ A}w_{TD}) = 0 $$

If $\text{A}$ is invertible,

$ W_{TD} = \text{A}^{-1} b$

It will be the solution of linear system, it is called **TD fixed point**. And it minimizes $(\text{b } - \text{ A}w)^T(\text{b } - \text{ A}w)$

### Relating the TD Fixed Point and the Minimum of the Value Error

$$ \overline{VE}(w_{TD}) \leq \frac{1}{1 - \gamma} \min\limits_{w} \overline{VE}(w) $$