## Model-Free Prediction

### Monte-Carlo Learning

Incremental Monte-Carlo Updates

- Updates $V(s)$ incrementally after episode $S_1,A_1,R_2,...,S_T$
- For each state $S_t$ with return $G_t$

\begin{align}
N(S_t) & \leftarrow N(S_t) +1 \\
V(S_t) & \leftarrow V(S_t) + \frac{1}{N(S_t)}(G_t - V(S_t))
\end{align}

- In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.

$$
V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))
$$


### Temporal-Difference Learning


Simple temporal-difference learning algorithm: TD(0)
- Update value $V(S_t)$ toward estimated Return $R_{t+1} + \gamma V(S_{t+1})$
$$
V(S_t) \leftarrow V(S_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) - V(S_t))
$$
- $R_{t+1} + \gamma V(S_{t+1})$ is called the TD target
- $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is called the TD error

## Model-Free Control

#### On-policy learning
- "Learn on the job"
- Learn about policy $\pi$ from experience sampled from $\pi$

#### Off-policy learning
- "Look over someone's shoulder"
- Learn about policy $\pi$ from experience sampled from $\mu$

### Generalised Policy Iteration With Monte-Carlo Evaluation

- Policy evaluation: Monte-Carlo policy evaluation, $V = v_\pi$ ?
- Policy improvement: Greedy policy improvement?
- Greedy policy improvement over $V(s)$ requires model of MDP
$$
\DeclareMathOperator*{\argmax}{arg\,max}
\pi^\prime(s) = \argmax\limits_{a \in A} \left( R^a_s + P^a_{s s^\prime} V(s^\prime) \right)
$$
- Monte-Carlo policy evaluation는 가능하나 Greedy policy improvement은 MDP를 알아야 해서 사용할 수 없다. ($R^a_s$와 $P^a_{s s^\prime}$를 알아야 한다.)

### Model-Free Policy Improvement Using Action-Value Function

- Greedy policy improvement over $Q(s,a)$ is model-free
$$
\DeclareMathOperator*{\argmax}{arg\,max}
\pi^\prime (s) = \argmax\limits_{a \in A} Q(s,a)
$$
- $Q(s,a)$를 사용한 Greedy policy improvement는 가능
- 그러나, $Q(s,a)$를 사용해서 Greedy policy improvement를 하다 보면, 시도하지 않는 action이 더 좋은 reward를 줄 수도 있다는 정보가 없기 때문에 학습이 정체되게 된다. (stuck)
- 따라서, $\epsilon-\text{Greedy}$ Exploration 기법을 policy improvement에 적용해야 한다.

### $\epsilon-\text{Greedy}$ Exploration
- With probability $1-\epsilon$ choose the greedy action
- With propability $\epsilon$ choose an action at random

### Monte-Carlo Policy Iteration
- Policy evaluation: Monte-Carlo policy evalution, $Q=q_\pi$
- Policy improvement: $\epsilon-\text{greedy}$ policy improvement

### Monte-Carlo Control
- Every episode:
- Policy evaluation: Monte-Carlo policy evalution, $Q \approx q_\pi$
- Policy improvement: $\epsilon-\text{greedy}$ policy improvement
- 수렴하려면 episode가 진행하면서 $\epsilon$을 0에 가깝게 줄여 주어야 한다. (GLIE)
$$
\epsilon_k = \frac{1}{k}
$$

### GLIE Monte-Carlo Control
- Every episode:
- Policy evaluation: Monte-Carlo policy evalution, $Q \approx q_\pi$
\begin{align}
N(S_t, A_t) & \leftarrow N(S_t, A_t) + 1\\
Q(S_t, A_t) & \leftarrow Q(S_t, A_t) + \frac{1}{N(S_t, A_t)}(G_t - Q(S_t,A_t)) \\
& \frac{1}{N(S_t, A_t)} \text{can be replaced with } \alpha
\end{align}
- Policy improvement: $\epsilon-\text{greedy}$ policy improvement
\begin{align}
\epsilon & \leftarrow 1 / k \\
\pi & \leftarrow \epsilon-\text{greedy}(Q)
\end{align}


### Updating Action-Value Function with Sarsa
- GLIE TD Control is Sarsa
$$
Q(S,A) \leftarrow Q(S,A) + \alpha \left( R + \gamma Q(S^\prime, A^\prime) - Q(S,A) \right)
$$

### Q-Learning Control Algorithm

$$
Q(S,A) \leftarrow Q(S,A) + \alpha \left( R + \gamma \max\limits_{a^\prime} Q(S^\prime, a^\prime) - Q(S,A) \right)
$$