# Control with Approximation

> This is the summary of lecture "Prediction and Control with Function Approximation" from Coursera.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Coursera, Reinforcement_Learning]
- image: 

## Episodic SARSA with Function Approximation

### State-values to action-values

$$ v_{\pi}(s) \approx \hat{v}(s, w) \doteq w^Tx(s) \\
q_{\pi}(s, a) \approx \hat{q}(s, a, w) \doteq w^Tx(s, a) $$

### Representing actions

$x(s) = \begin{bmatrix} x_0(s) \\ x_1(s) \\ x_2(s) \\ x_3(s) \end{bmatrix} \\ \mathcal{A}(s) = \{a_0, a_1, a_2\}$

$x(s, a) = \begin{bmatrix} x_0(s) \\ x_1(s) \\ x_2(s) \\ x_3(s) \\ x_0(s) \\ x_1(s) \\ x_2(s) \\ x_3(s) \\ x_0(s) \\ x_1(s) \\ x_2(s) \\ x_3(s) \end{bmatrix}$

This is called **stacked features**

### Episodic Semi-gradient SARSA for Estimating $\hat{q} \approx q_{*}$
$\begin{aligned}
&\text{Input: a differentiable action-value function parameterization } \hat{q}: \mathcal{s} \times \mathcal{A} \times \mathbb{R}^d \to \mathbb{R} \\
&\text{Algorithm paramters: step size } \alpha > 0, \text{ small } \epsilon > 0 \\
&\text{Initialize value-function weights } w \in \mathbb{R}^d \text{ arbitrarily (e.g., } w = 0 \text{)} \\
\newline
&\text{Loop for each episode:} \\
&\quad S, A \leftarrow \text{ initial state and action of episode (e.g., } \epsilon \text{-greedy)} \\
&\quad \text{Loop for each step of episode:} \\
&\qquad \text{Take action } A, \text{ observe } R, S' \\
&\qquad \text{If } S' \text{ is terminal:} \\
&\qquad \quad w \leftarrow w + \alpha[R - \hat{q}(S, A, w)] \nabla \hat{q}(S, A, w) \\
&\qquad \quad \text{Go to next episode} \\
&\qquad \text{Choose } A' \text{ as a function of } \hat{q}(S', \cdot, w) \text{ (e.g., } \epsilon \text{-greedy)} \\
&\qquad w \leftarrow w + \alpha[R + \gamma \hat{q}(S', A', w) - \hat{q}(S, A, w)] \nabla \hat{q}(S, A, w) \\
&\qquad S \leftarrow S' \\
&\qquad A \leftarrow A' \\
\end{aligned}$

## Expected Sarsa with Function Approximation

### From SARSA to Expected SARSA

SARSA:

$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha + \big(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\big) $

Expected SARSA:

$ Q(S_t, Q_t) \leftarrow Q(S_t, A_t) + \alpha + \big(R_{t+1} + \gamma \sum\limits_{a'} \pi(a' \vert S_{t+1} ) Q(S_{t+1}, a') - Q(S_t, A_t) \big) $

### Expected SARSA with Function Approximation

SARSA:

$ w \leftarrow w + \alpha \big(R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, w) - \hat{q}(S_t, A_t, w)\big) \nabla \hat{q}(S_t, A_t, w) $

Expected SARSA:

$ w \leftarrow w + \alpha \big(R_{t+1} + \gamma \sum\limits_{a'} \pi(a' \vert S_{t+1}) \hat{q}(S_{t+1}, a', w) - \hat{q}(S_t, A_t, w) \big) \nabla \hat{q}(S_t, A_t, w) $

### Expected SARSA to Q-learning

$ w \leftarrow w + \alpha \big( R_{t+1} + \gamma \max\limits_{a'} \hat{q}(S_{t+1}, a', w) - \hat{q}(S_t, A_t, w)\big) \nabla \hat{q}(S_t, A_t, w) $

## Exploration under Function Approximation

### Epsilon-Greedy

![eg](image/epsilon_greedy.png)