# On-policy Prediction with Approximation

> This is the summary of lecture "Prediction and Control with Function Approximation" from Coursera.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Coursera, Reinforcement_Learning]
- image: 

## Moving to Parameterized Functions

### Parameterizing the Value Function

$$ \hat{v}(s, w) \approx v_{\pi}(s) $$

For example, suppose we have the state space in 2D dimension, $X, Y$. Then we can express our value function like this,

$$ \hat{v}(s, w) \doteq w_1 X + w_2 Y $$

### Linear Value Function Approximation

$$ \begin{aligned} \hat{v}(s, w) &\doteq \sum w_i x_i(s) \\ 
&= <w, x(s) > \end{aligned} $$

### Nonlinear Function Approximation

![nfa](image/nfa.png)

## Generalization and Discrimination

### Categorizing methods based on generalization and discrimination

![gene-desc](image/gene_desc.png)

## Framing Value Estimation as Supervised Learning

- The function approximator should be compatible with online updates (online means that the full dataset is available from the start and remains fixed throughout learning.)

- Thf function approximator should be compatible with Bootstrapping (usually, the target in RL problem depends on feature weight $w$)

## The Value Error Objective

### The Mean Squared Value Error Objective

$$\overline{VE} =  \sum_{s} \mu(s)[v_{\pi}(s) - \hat{v}(s, w)]^2 $$

Here, $\mu(s)$ is the fraction of time we spend in $S$ when following policy $\pi$

## Introducing Gradient Descent

### Gradient - Derivatives in Multiple Dimensions

$$ w \doteq \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_d \end{bmatrix} \nabla f \doteq \begin{bmatrix} \frac{\partial f}{\partial w_1} \\ \frac{\partial f}{\partial w_2} \\ \dots \\ \frac{\partial f}{\partial w_d} \end{bmatrix} $$

The sign indicates the direction to change $w$ in order to increase $f$. And the magnitude means how quickly $f$ changes.

The gradient gives the direction of steepest ascent.

## Gradient Monte for Policy Evaluation

### Gradient of the Mean Squared Value Error Objective

$\begin{aligned}  &\nabla \sum\limits_{s \in \mathcal{S}} \mu(s) [ v_{\pi}(s) - \hat{v}(s, w)]^2 \\ 
&= \sum\limits_{s \in \mathcal{S}} \mu(s) \nabla [v_{\pi}(s) - \hat{v}(s, w)]^2 \\
&= - \sum\limits_{s \in \mathcal{S}} \mu(s) 2 [v_{\pi}(s) - \hat{v}(s, w)] \nabla \hat{v}(s, w) \end{aligned}$

from previous definition,

$ \hat{v}(s, w) \doteq < w, x(s) > $

So, the gradient of value function is

$ \nabla \hat{v}(s, w) = x(s) $

As a result,

$ \Delta w \propto \sum\limits_{s \in \mathcal{S}} \mu(s) [v_{\pi}(s) - \hat{v}(s, w)] \nabla \hat{v}(s, w) $

### Gradient Monte Carlo

$$ w_{t+1} \doteq w_{t} + \alpha [ G_t - \hat{v}(S_t, w_t) ] \nabla \hat{v}(S_t, w_t) $$

Recall that

$$ v_{\pi}(s) \doteq \mathbb{E}_{\pi}[G_t \vert S_t = s] $$

$$ \begin{aligned} &\mathbb{E}_{\pi}\big[2 [v_{\pi}(S_t) - \hat{v}(S_t, w)] \nabla \hat{v}(S_t, w) \big] \\
&= \mathbb{E}_{\pi} \big[ 2[ G_t - \hat{v}(S_t, w)] \nabla \hat{v}(S_t, w)\big] \end{aligned} $$

### Gradient Monte Carlo Algorithm for Estimating $\hat{v} \approx v_{\pi}$

$\begin{aligned}
&\text{Input: the policy } \pi \text{ to be evaluated } \\
&\text{Input: a differentiable function: } \hat{v} : \mathcal{S} \times \mathbb{R}^d \rightarrow \mathbb{R} \\
&\text{Algorithm parameter: step size } \alpha > 0 \\
&\text{Initialize value-function weights } w \in \mathbb{R}^d \text{ arbitrarily (e.g., } w = 0\text{)} \\
\newline
&\text{Loop forever (for each episode):} \\
&\quad \text{Generate an episode } S_0, A_0, R_1, S_1, A_1, \dots, R_T, S_T \text{ using } \pi \\
&\quad \text{Loop for each step of episode, } t=0, 1, \dots, T-1: \\
&\qquad w \leftarrow w + \alpha[G_t - \hat{v}(S_t, w)] \nabla \hat{v}(S_t, w)
\end{aligned}$