In [1]:
from envs import ForbiddenAreaCfg, RewardCfg, make_gridworld

env = make_gridworld(
    forbidden_area_cfg=ForbiddenAreaCfg(num=6),
    reward_cfg=RewardCfg(forbidden_area=-10),
)

# TD learning of state value

## State Value 

$v_{\pi}(s_t) = \mathbb{E}[G_t|S_t=s_t]$

or

$v_{\pi}(s_t) = \mathbb{E}[r_{t+1} + \gamma v_{\pi}(s_{t+1})]$

## Algorithm

We aim to estimate the state value of a given policy $\pi$.

Consider this formula:

$g(v_{\pi}(s_t)) = v_{\pi}(s_t) - \mathbb{E}[r_{t+1} + \gamma v_{\pi}(s_{t+1})|s_t]$

We can use Robbins-Monro algorithm to solve $g_t(v_{\pi}(s_t)) = 0$.

Then we have TD learning of state value algorithm:

$v_{t+1}(s_{t}) = v_{t}(s_t) - \alpha_t(s_t)[v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})]]$

This algorithm is driven by the temporal difference error and makes the state value estimate closer to the target value $\bar{v_t}$ ($\bar{v_t}$ is not the mean of $v_t$, it is an ideal value $v_{\pi}$).


## Properties

- It **only** estimates the state value of a given policy.
- It does not estimate the action value.
- It does not search for optimal policies.

# TD learning of action value: Sarsa

## Action Value

$q_{\pi}(s_t, a_t) = \mathbb{E}[G_t|S_t=s_t, A_t=a_t]$

or

$q_{\pi}(s_t, a_t) = \mathbb{E}[r_{t+1} + \gamma v_{\pi}(s_{t+1})]$

## Algorithm

Suppose we have some experience/data $\{(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})\}$.

We can use the following formula to estimate the action value:

$q_{t+1} = q_{t} - \alpha_t(s_t, a_t)[q_t(s_t, a_t) - [r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})]]$

> The above formula is to solve $g(q_{\pi}) = 0$,
>
> where $g(q_{\pi}(a_t, s_t)) = q_{\pi}(a_t, s_t) - \mathbb{E}[r_{t+1} + \gamma q_{\pi}(a_{t+1}, s_{t+1})|s_t, a_t]$

This formula is used to solve the Bellman equation in terms of action values:

$$
\begin{align*}
q_{\pi}(s, a) &= \sum_r rp(r|s, a) + \gamma\sum_{s'} \sum_{a'}q_{\pi}(s', a')p(s'|s, a)\pi(a'|s')\\
&= \sum_r rp(r|s, a) + \gamma \sum_{s'} p(s'|s, a) \sum_{a'}q_{\pi}(s', a')\pi(a'|s')\\\
&= \sum_r rp(r|s, a) + \gamma \sum_{s'} \sum_{a'} q_{\pi}(s', a') p(s', a'|s, a)\\
&= \mathbb{E}[r + \gamma q_{\pi}(s', a')|s, a]\\
\end{align*}
$$

where
$$
\begin{align*}
p(s', a'|s, a)
&= p(s'|s, a)p(a'|s', s, a)\\
&= p(s'|s, a)p(a'|s')\\
&= p(s'|s, a)\pi(a'|s')\\
\end{align*}
$$