# Chapter 6: Temporal-Difference Learning

## 1. TD Prediction
- combination of MC and DP
    - use experience
    - bootstrapping
- *constant-α* MC for simple every-visit MC method:
    - must wait until the end of the episode to obtain $G_t$
$$V(S_t) = V(S_t) + \alpha\big[G_t-V(S_t)\big]$$
- **one-step TD** - *TD(0)* method:
    - bootstrapping
$$V(S_t) = V(S_t) + \alpha\big[R_{t+1}+\gamma V(S_{t+1})-V(S_t)\big]$$

![TD 0](assets/6.1.td0.png)


- TD *error*:
$$\delta_t = R_{t+1}+\gamma V(S_{t+1})-V(S_t)$$

    if, array $V$ does not change during the episode, then MC error can be written as a sum of TD errors:
    $$G_t-V(S_t) = \sum_{k=t}^{T-1}\gamma^{k-t}\delta_k$$
    
- Advantages of TD prediction
    - do not require a model of the environment, only experience
    - can implement in an online, fully incremental fashion
        - wait only one time step
    - converge to the true values
    - TD is faster than constant-α MC, in pratice
        - open question for proving mathematically

## 2. Optimality of TD(0)
- **Batch Updating**: only update after processing each complete *batch* of training data
    - compute approximate value function incrementally by TD or MC
    - only update value function by the sum of all the increments
- TD(0) converges deterministically to a single answer if $\alpha$ is sufficiently small
- constant-α MC also converge with the same conditions but to a *different answer*
- TD perform better for future, MC is better for past data
    - TD finds *maximum-likelihood* model of the Markov process
        - converge to the **certaintly-equivalence estimate** : if model is correct, estimating value function will be correct
    - MC minimie *mean-squared error* on the training set
- TD is faster than MC because it computes the true certaintly-equivalence estimate
- TD methods can approximate the same solution as certaintly-equivalence estimate using less memory and computation cost

## 3. Sarsa: On-policy TD Control
- Learn an action-value function $q_\pi(s, a)$ for the current behavior policy
![state-action pairs transition](assets/6.4.state-action-pairs.png)

- TD(0) for action-value function:
$$Q(S_t,A_t)\gets Q(S_t,A_t)+\alpha\big[R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)\big]$$
    - SARSA: $S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}$
  
- Continually estimate $q_\pi$ for the behavior policy $\pi$, and change $\pi$ toward greediness with respect to $q_\pi$ at the same time

![sarsa](assets/6.4.sarsa.png)

## 4. Q-learning: Off-policy TD Control
- The learned action-value function $Q$, directly approximates the optimal $q_*$, independent of the policy being followed

$$Q(S_t,A_t)\gets Q(S_t,A_t)+\alpha\big[R_{t+1}+\gamma\max_aQ(S_{t+1},a)-Q(S_t,A_t)\big]$$

![Q-learning](assets/6.5.q-learning.png)

## 5. Expected Sarsa
- Just like Q-learning exept that taking into account how likely each action is under the current policy
- Instead of the sample value of next state, use the **expectation**:

$$
\begin{aligned}
Q(S_t,A_t) &\gets Q(S_t,A_t)+\alpha\Big[R_{t+1}+\gamma E_\pi\big[Q(S_{t+1},A_{t+1}) | S_{t+1}\big]-Q(S_t,A_t)\Big]
\\ &\gets Q(S_t,A_t)+\alpha\Big[R_{t+1}+\gamma \sum_a\pi(a | S_{t+1})Q(S_{t+1},a)-Q(S_t,A_t)\Big]
\end{aligned}
$$

- E-Sarsa is more complex computationally than Sarsa
-  but, perform better than Sarsa, because of eliminating the variance due to random selection of $A_{t+1}$
- Might use for off-policy algorithm
    - includes Q-learning as the sepcial case in which $\pi$ is the greedy policy
- E-Sarsa subsumes and generalizes Q-learning while reliably improving over Sarsa
- Additional computationl cost, but completely dominate both of the other more-well-known TD control algorithms

## 6. Maximization Bias and Double Learning
- **Maxsimization Bias**: maximum over estimated values can lead to a significant positive bias
- Avoid by **Double Learning**
    - Divide the plays in two sets and learn independent estimates $Q_1(a)$, $Q_2(a)$
    - Use one (e.x. $Q_1$) to get maximizing action: $A^*=\arg\max_a Q_1(a)$
    - Other (e.x. $Q_2$) to provide the estimate of its value: $Q_2(A^*)=Q_2(\arg\max_a Q_1(a))$
    - The result is that it will then unbiased: $E[Q_2(A^*)]=q(A^*)$
- Note that:
    - only one estimate is updated on each play
    - double the memory requirements
    - does not increase the amount of computation per step
- *Double Q-learning*:
$$Q_1(S_t,A_t)\gets Q_1(S_t,A_t)+\alpha\big[R_{t+1}+\gamma Q_2\big(S_{t+1},\arg\max_aQ_1(S_{t+1}, a)\big)-Q_1(S_t,A_t)\big]$$

![Double Q-learning](assets/6.7.double-q-learning.png)

## 7. Games, Afterstates, and Other Special Cases
- **Afterstates**: State after agent has acted
    - value functions are **afterstate value functions**
    - useful when we have knowledge of an initial part of the environment's dynamics
- Move to *afterposition*  must have the same value, so can transfer to other pair have same afterstate without separately learn
- Can use in GPI with a policy and afterstate value function