# Chapter 4: Dynamic Programming

## 1. Policy Evaluation (Prediction)
- Given policy $\pi$, compute the state-value function $v_\pi$
- Iterative Policy Evaluation $\forall s\in\mathcal S$:
$$
\begin{aligned}
v_{k+1}(s) &= E_{\pi}[R_{t+1}+\gamma v_k(S_{t+1}) | S_t=s]
\\ &= \sum_a\pi(a | s)\sum_{s^\prime, r} p(s^\prime, r | s,a)\big[ r+\gamma v_k(s^\prime) \big]
\end{aligned}
$$
- Initial $v_0$ is chosen arbitrarily, however, all terminal state must be 0
- $k\to\infty$, sequence $\{v_k\}$ can be converge to $v_\pi$
- Each update step is **expected update**

![Iterative Policy Evaluation](assets/4.1.algo.png)

## 2. Policy Improvement
- Policy improvement:
    - if $q_\pi(s, \pi^\prime(s))\ge v_\pi(s)$, then $v_{\pi^\prime}(s)\ge v_\pi(s)$
    - where, $\displaystyle q_\pi(s,a)=E\big[R_{t+1}+\gamma v_\pi(S_{t+1}) | S_t=s,A_t=a\big]=\sum_{s^\prime,r}p(s^\prime,r | s,a)[r+\gamma v_\pi(s^\prime)]$
 
- new **greedy policy**, $\pi'$ takes the action that looks best in the short term -- after one step of lookahead according to $v_\pi$:
$$
\begin{aligned}
\pi'(s) &= \arg\max\limits_a q_\pi(s,a)
\\ &= \arg\max\limits_a E\big[R_{t+1}+\gamma v_\pi(S_{t+1}) | S_t=s,A_t=a\big] 
\\ &= \arg\max\limits_a \sum_{s^\prime,r}p(s^\prime,r | s,a)[r+\gamma v_\pi(s^\prime)]
\end{aligned}
$$

- if $v_{\pi'}=v_\pi$, then $v_{\pi'}=v_*$

## 3. Policy Iteration
- Generalized Policy Iteration
![General Policy Iteration](assets/4.6.policy_iteration.png)

- Policy Iteration Algorithm:
    - On policy $\pi$, using $v_\pi$ to yield a better policy $\pi'$
    - With $\pi'$, computing $v_{\pi'}$ to impove it a again to yield an even better $\pi''$
    
![Policy Iteration](assets/4.2.policy_iteration.png)


## 4. Value Iteration
- Value Iteration, $\forall s\in\mathcal S$:
$$v_{k+1}(s)=\max_a\sum_{s',r}p(s',r | s,a)\big[r+\gamma v_k(s')\big]$$
    - compute for optimal policy only
    - truncated policy iteration
    - policy evaluation is stopped after just one sweep

![Value Iteration](assets/4.3.value_iteration.png)

## 5. Summary
- Policy evaluation: backups without a max
- Policy improvement: form a greedy policy, if only locally
- Policy iteration: alternate the above two processes
- Value iteration: backups with a max
- Full backups (to be contrasted later with sample backups)
- Generalized Policy Iteration (GPI)
- Asynchronous DP: a way to avoid exhausitve sweeps by random selecting a state to compute at each step
- Bootstrapping: update estimates based on other estimates
- Biggest limiation of DP is that it requires a *probability model* (as opposed to a generative or simulation model)