# Temporal Difference Learning methods for Control

> This is the summary of lecture "Sample-based Learning Methods" from Coursera.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Coursera, Reinforcement_Learning]
- image: 

## SARSA - GPI with TD

### Generalized Policy Iteration

![gpi](image/gpi.png)

### SARSA

The acronym describes the data used in the updates.

$$ S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1} $$

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\big) $$

Recall the TD update,

$$ V((S_t) \leftarrow V(S_t) + \alpha \big( R_{t+1} + \gamma (V(S_{t+1}) - V(S_t)\big) $$

As a result, **SARSA** is the GPI algorithm that uses TD for policy evaluation.

## What is Q-learning?

### Q-learning (off-policy TD Control) for estimating $\pi \approx \pi_{*}$

$\begin{aligned}
&\text{Algorithm parameters: step size } \alpha \in (0, 1], \text{ small } \epsilon > 0 \\
&\text{Initialize } Q(s, a), \text{ for all } s \in \mathcal{S}^{+}, a \in \mathcal{A}(s), \text{ arbitrarily except that } Q(terminal, \cdot) = 0 \\
\newline
&\text{Loop for each episode: } \\
&\quad \text{Initialize } S\\
&\quad \text{Loop for each step of episode:} \\
&\qquad \text{Choose } A \text{ from } S \text{ using policy derived from } Q \text{ (e.g., } \epsilon-\text{greedy)} \\
&\qquad \text{Take action } A, \text{ observe } R, S' \\
&\qquad Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma \max_{a} Q(S', a) - Q(S, A)] \\
&\qquad S \leftarrow S' \\
&\quad \text{until } S \text{ is terminal} \\
\end{aligned}$

### Revisiting Bellman equations

- SARSA: SARSA is sample-based version of policy iteration, and it uses standard bellman equation for action value.

$q_{\pi}(s, a) = \sum_{s', r} p(s', r \vert s, a) \big(r + \gamma \sum_{a'}\pi(a' \vert s') q_{\pi}(s', a')\big) \\
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\big) $

- Q-learning: Q-learning is sample-based version of value iteration, and it uses bellman optimality equation.

$q_{*}(s, a) = \sum_{s', r} p(s', r \vert s, a) \big(r + \gamma \max_{a'} q_{*}(s', a') \big) \\
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \big( R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)\big) $

## How is Q-learning off-policy?

SARSA selects its action from current policy ($\pi$). In this case, the target policy and the behavior policy is the same.(**on-policy**) But in Q-learning, action is selected from optimal policy($\pi_{*}$), that is, the target policy is not the same as the behavior policy. This is called **off-policy**. But when taking the expection of state-action value in Q-learning, it is same as current policy, so it doesn't require any importance sampling techniques.