### Approaches to Solving MDPs III: Temporal Difference

Last updated: March 15, 2022


### SOURCES 

- Reinforcement Learning, RS Sutton & AG Barto, 2nd edition. Chapter 6
- Mastering Reinforcement Learning with Python, Enes Bilgin. Chapter 5

### LEARNING OUTCOMES

- Understand how the temporal-difference method makes updates
- Explain how TD updates are an improvement over MC updates

### CONCEPTS

- Temporal-Difference method
- Temporal-Difference error

---  

### Temporal-Difference (TD) Methods

---  

![mc_vs_td](images/mc_vs_td.png)  
**Illustration Comparing MC Updates to TD Updates**

---

One limitation of MC is that we must simulate a full trajectory before updating a policy

TD methods learn from experience and can provide updates after each time step using *bootstrapping*  

**Revisiting the commute time example**  
if we realize mid-trip that we're running 30 minutes late, we can use this information  
mid-trip to update the total commute time estimate.


**What do we mean by bootstrapping?**  
Basing an update on an existing estimate. The initial estimates are incorrect but useful as we can refine them.


**TD(0)** is a method that learns after one time step  

The method is mathematically sound and for any policy $\pi$, the estimate of value function $v_\pi$ converges

Modern RL algorithms implement TD methods with function approximation - specifically neural nets

#### TD Prediction

Let's start with state-value function of policy $\pi$ starting from state *s*:

$v_\pi(s) = E_\pi[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s]$

Taking a step from *s* under $\pi$ produces action *a*, reward *r* and next state *s'*

observed quantity $r + \gamma v_\pi(s')$ gives new estimate of $v_\pi(s)$ based on one sample

**Important idea: use this observation to update the existing value estimate by moving it closer to new observation** 

Don't want to discard old estimate, as new one may be noisy and it's sample size of one.


#### Update Rule

We update $v_\pi(s)$ after a single step by computing the convex combination of its old estimate and the observed quantity.  
A weight $\alpha\in (0,1]$ is applied for mixing the two, where larger value weights the recent observation more heavily.

We will use $V$ to denote values which are based on sample data under policy $\pi$.

$V(s) := (1-\alpha) V(s) + \alpha (r + \gamma V(s'))$ 

Notice: we now have a way to update the state value after a single transition based on data.

#### TD Error

Let's rearrange the update equation:


$
\begin{aligned}  
V(s) &:= (1-\alpha) V(s) + \alpha (r + \gamma V(s'))  \\
&= V(s) + \alpha [r + \gamma V(s') -  V(s)]
\end{aligned}
$

This form make the update clear, and it presents a term $[r + \gamma V(s') -  V(s)]$ which we call *TD error*.  

We saw an update form like this earlier. A similar form appears in gradient descent.

**Question 1**  
Think about the TD error and how it works. Does it make sense?

**Question 2**  
What happens when $\alpha=0$? What happens when $\alpha=1$?  
Is it a good idea to use these values?

---

#### Finding the Optimal Policy

We won't have the environment dynamics. To improve the policy, we need estimate of the action values $q(s,a)$.

As a reminder, $q_\pi(s,a)$ represents starting in state $s$, taking action $a$ and subsequently following policy $\pi$.

Why do we take action $a$? This helps us try to improve the policy.



A popular on-policy method is *Sarsa*.

A popular off-policy method is *Q-learning*, and this method (and related methods) are most popular.

We will discuss *Q-learning* going forward. Taking TD(0) update steps will be essential.

---