# Chapter 7: n-step Bootstrapping

## 1. n-step TD Prediction
- Generalize one-step TD(0) method
- Temporal difference extends over n-steps

![n-step methods](assets/7.1.n-step.png)

- Want to update estimated value $v_\pi(S_t)$ of state $S_t$ from:
$$S_t,R_{t+1},S_{t+1},R_{t+1},...,R_T,S_T$$

    - for *MC*, target is complete return
    $$G_t = R_{t+1}+\gamma R_{t+3}+\gamma^2R_{t+3}+...+\gamma^{T-t-1}R_T$$
    - for *TD*, one-step method
    $$G_{t:t+1} = R_{t+1}+\gamma V_t(S_{t+1})$$
    - for *two-step TD*, one-step method
    $$G_{t:t+2} = R_{t+1}+\gamma R_{t+2}+\gamma^2V_{t+1}(S_{t+2})$$
    - for *n-step TD*, one-step method with $n\ge 1, 0\le t<T-n$
    $$
    \begin{cases}
    G_{t:t+n} &= R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1}R_{t+n}+\gamma^nV_{t+n-1}(S_{t+n})
    \\G_{t:t+n} &= G_t ~~~,\text{if } t+n\ge T
    \end{cases}
    $$


- Wait for $R_{t+n}, V_{t+n-1}$, until time $t+n$, then update estimate values:
$$V_{t+n}(S_t) = V_{t+n-1}(S_t)+\alpha\big[G_{t:t+n}-V_{t+n-1}(S_t)\big] ~~~, 0\le t<T$$
    - all other states remain unchanged: $V_{t+n}(s)=V_{t+n-1}(s), \forall s\neq S_t$

![n-step methods](assets/7.1.n-step-td.png)


- **Error Reduction Property** of n-step returns:
$$\max_s\big| E_\pi[G_{t:t+n} | S_t=s]-v_\pi(s)\big| \le \gamma^n\max_s\big| V_{t+n-1}(s)-v_\pi(s)\big|, \forall n\ge 1$$
- Can show formally that n-step TD methods converge to the correct predictions

## 2. n-step Sarsa
- Switch states for actions (state-action pairs) and then use an ε-greedy policy

![n-step SARSA](assets/7.2.n-step-sarsa.png)

- n-step returns for action-value:
$$G_{t:t+n}=R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1}R_{t+n}+\gamma^nQ_{t+n-1}(S_{t+n},A_{t+n})~~~, n\ge 1, 0\le t<T-n$$
    with $G_{t:t+n}=G_t \text{ if }t+n\ge T$


- **n-step Sarsa**:
    $$Q_{t+n}(S_t,A_t)=Q_{t+n-1}(S_t,A_t)+\alpha\big[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)\big]~~~,0\le t<T$$
 
 
- **n-step Expected Sarsa**:
$$G_{t:t+n}=R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1}R_{t+n}+\gamma^n\overline V_{t+n-1}(S_{t+n},A_{t+n})~~~, t+n<T$$
    - where, *expected approximate value* of state $s$:
    $$\overline V_t(s)=\sum_a\pi(a | s)Q_t(s,a) ~~~, \forall s\in\mathcal S$$
    - if $s$ is terminal, then $\overline V(s)=0$
    
![n-step SARSA Pseudocode](assets/7.2.n-step-sarsa-pseudocode.png)

## 3. n-step Off-policy Learning
- Use relative probability of just n actions:
$$\rho_{t:h}=\prod_{k=t}^{\min(h,T-1)}\frac{\pi(A_k | S_k)}{b(A_k | S_k)}$$

- n-step TD:
$$V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha\color{blue}{\rho_{t:t+n-1}}\big[G_{t:t+n}-V_{t+n-1}(S_t)\big]~~~,0\le t<T$$

- n-step Sarsa:
$$Q_{t+n}(S_t,A_t)=V_{t+n-1}(S_t,A_t)+\alpha\color{blue}{\rho_{t+1:t+n}}\big[G_{t:t+n}-Q_{t+n-1}(S_t,A_1)\big]~~~,0\le t<T$$

- n-step Expected Sarsa:
$$Q_{t+n}(S_t,A_t)=V_{t+n-1}(S_t,A_t)+\alpha\color{blue}{\rho_{t+1:t+n-1}}\big[G_{t:t+n}-Q_{t+n-1}(S_t,A_1)\big]~~~,0\le t<T$$

![n-step Off-policy](assets/7.3.n-step-off-policy.png)

## 4. Per-decision Methods with Control Variates
- add *control variate* to **off-policy** of n-step return to reduce variance
$$G_{t:h}=\rho_t(R_{t+1}+\gamma G_{t+1:h})+(1-\rho_t)V_{h-1}(S_t) ~~~,t<h<T$$
    where, $G_{h:h}=V_{h-1}(S_h)$

- if $\rho_t=0$, then the target does not change
- Includes on-policy when $\rho_t=1$
- for action values, the first action does not play a role in the importance sampling
$$
\begin{aligned}
G_{t:h} &= R_{t+1}+\gamma\big(\rho_{t+1}G_{t+1:h}+\overline V_{h-1}(S_{t+1})-\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1})\big)
\\ &= R_{t+1}+\gamma\rho_{t+1}\big(G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1})\big)+\gamma\overline V_{h-1}(S_{t+1})
\end{aligned}
$$
    where, $t<h\le T$, if $h<T$, then $G_{h:h}=Q_{h-1}(S_h,A_h)$, else $G_{T-1:h}=R_T$

## 5. Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm
- Use **left nodes** to estimate action-values


![example](assets/7.5.off-policy-wto-weight.png)


- one-step return is them same as Expected Sarsa for $t<T-1$:
$$G_{t:t+1}=R_{t+1}+\gamma\sum_a\pi(a | S_{t+1})Q_t(S_{t+1},a)$$
- two-step tree-backup for $t<T-2$:
$$
\begin{aligned}
G_{t:t+1} &= R_{t+1}+\gamma\sum_{a\neq A_{t+1}}\pi(a | S_{t+1})Q_{t+1}(S_{t+1},a)
\\ & ~~~ +\gamma\pi(A_{t+1} | S_{t+1})\big(R_{t+1:t+2}\gamma\sum_{a\neq A_{t+1}}\pi(a | S_{t+2})Q_{t+1}(S_{t+2},a)\big)
\\ &= R_{t+1}+\gamma\sum_{a\neq A_{t+1}}\pi(a | S_{t+1})Q_{t+1}(S_{t+1},a)+\gamma\pi(A_{t+1} | S_{t+1})Q_{t+1:t+2}
\end{aligned}
$$

- n-step tree-backup for $t<T-1,n\ge 2$:
$$G_{t:t+1} = R_{t+1}+\gamma\sum_{a\neq A_{t+1}}\pi(a | S_{t+1})Q_{t+1}(S_{t+1},a)+\gamma\pi(A_{t+1} | S_{t+1})Q_{t+1:t+n}$$

- action-value update rule as usual from n-step Sarsa:
$$Q_{t+n}(S_t,A_t)=Q_{t+n-1}(S_t,A_t)+\alpha[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)]$$
    for, $0\le t < T$


![n-step Tree Backup](assets/7.5.n-step-tree-backup.png)

## 6. A Unifying Algorithm: n-step Q(σ)

![n-step-types](assets/7.6.n-step-types.png)

- $\sigma_t\in[0,1]$ denote the degree of sampling on step $t$
    - $\sigma=0$ for full sampling
    - $\sigma=1$ for pure expection

- Rewrite the n-step back-up tree as:
$$
\begin{aligned}
G_{t:h} &= R_{t+1}+\gamma\sum_{a\neq A_{t+1}}\pi(a | S_{t+1})Q_{h-1}(S_{t+1},a)+\gamma\pi(A_{t+1} | S_{t+1})G_{t+1:h}
\\ &= R_{t+1}+\gamma\overline V_{h-1}(S_{t+1})-\gamma\pi(A_{t+1} | S_{t+1})Q_{h-1}(S_{t+1},A_{t+1})+\gamma\pi(A_{t+1} | S_{t+1})G_{t+1:h}
\\ &= R_{t+1}+\gamma\pi(A_{t+1} | S_{t+1})\big(G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1})\big)+\gamma\overline V_{h-1}(S_{t+1})
\end{aligned}
$$

- n-step $Q(\sigma)$:
$$G_{t:h}=R_{t+1}+\gamma\big(\sigma_{t+1}\rho_{t+1}+(1-\sigma_{t+1})\pi(A_{t+1} | S_{t+1})\big)\big(G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1})\big)+\gamma\overline V_{h-1}(S_{t+1})$$
    
    where, $t<h\le T$
    - if $h<T$, then $G_{h:h}=Q_{h-1}(S_h,A_h)$
    - if $h=T$, then $G_{T-1:T}=R_T$


![n-step-q](assets/7.6.n-step-q.png)