# Chapter 3: Dynamic Programming

## 1. Exercise 4.1
$\pi$ is equiprobable random policy, so all actions equally likely.

- $q_\pi(11, down)$

With current state $s=11$ and action $a=down$, we have next is the terminal state $s'=end$, which have reward $R'=0$

$$
\begin{aligned}
q_\pi(11, down) &= \sum_{s',r}p(s',r | s,a)\big[r+\gamma v_\pi(s')\big]
\cr &= 1 * (-1 + 0)
\cr &= -1
\end{aligned}
$$

- $q_\pi(7, down)$

With current state $s=7$ and action $a=down$, we have next is the terminal state $s'=11$, which have state-value function $v_\pi(s)$

$$
\begin{aligned}
q_\pi(7, down) &= \sum_{s',r}p(s',r | s,a)\big[r+\gamma v_\pi(s')\big]
\cr &= 1 * \big[-1 + \gamma v_\pi(s')\big]
\cr &= -1 + \gamma v_\pi(s')
\end{aligned}
$$

## 2. Exercise 4.2
- Transitions from the original states are unchanged
$$
\begin{aligned}
v_\pi(15) &= \sum_a \pi(a|s=15)\sum_{s',r}p(s',r|s,a)\big[r+\gamma v_\pi(s')\big]
\cr &= 0.25\big[1*\big(-1+\gamma v_\pi(12)\big)+1*\big(-1+\gamma v_\pi(13)\big)+1*\big(-1+\gamma v_\pi(14)\big)+1*\big(-1+\gamma v_\pi(15)\big)\big]
\cr &= -1 + 0.25\gamma\sum_{s=12}^{15}v_\pi(s)
\end{aligned}
$$

In which, $\displaystyle v_\pi(13)=-1 + 0.25\gamma\sum_{s\in\{9,12,13,14\}}v_\pi(s)$ 
- Add action **down** to state 13, to go to state 15

Compute Fomular is similar to above:
$$v_\pi(15)=-1 + 0.25\gamma\sum_{s=12}^{15}v_\pi(s)$$

But, $\displaystyle v_\pi(13)=-1 + 0.25\gamma\sum_{s\in\{9,12,14,15\}}v_\pi(s)$ 

## 3. Exercise 4.3
- $q_\pi$ evaluation
$$
\begin{aligned}
q_\pi(s, a) &= E[G_t | S_t=s, A_t=a]
\cr &= E[R_{t+1}+\gamma G_{t+1} | S_t=s, A_t=a]
\cr &= E[R_{t+1}+\gamma V_\pi(S_{t+1}) | S_t=s, A_t=a]
\cr &= \sum_{s',r}p(s',r | s,a)\big[r+\gamma v_\pi(s')\big]
\end{aligned}
$$

- Update rule for $q_\pi$
$$
\begin{aligned}
q_{k+1}(s, a) &= E_\pi[R_{t+1} + \gamma v_k(S_{t+1}) | S_t=s, A_t=a]
\cr &= \sum_{s',r}p(s',r | s,a)\big[r+\gamma v_k(s')\big]
\cr &= \sum_{s',r}p(s',r | s,a)\Big[r+\gamma \sum_{a'\in\mathcal A(s')}\pi(a' | s')q_k(s', a')\Big]
\end{aligned}
$$

## 4. Exercise 4.4
When, the policy continually switches between two or more policies that are equally good, the difference between switches is small, so policy evaluation loop will be breaked before convergence.

$$\Delta = \max\big(\Delta, | v-V(s) |\big)$$

So, in this case, it maybe useful if we talk the sum of all differences
$$\Delta = \Delta + | v-V(s) |$$

## 5. Exercise 4.5

Policy Iteration algorithm for action values

### 1. Initialization
$\quad \pi(s)\in\mathcal A(s)$ and $Q(s,a)\in\mathbb R$ arbitrarily for all $s\in\mathcal S$ and $a\in\mathcal A(s)$

### 2. Policy Evaluation
$\quad$Loop:
  
$\quad\quad \Delta\gets0$

$\quad\quad$ Loop for each $s\in\mathcal S$

$\quad\quad\quad$ Loop for each $a\in\mathcal A(s)$

$\quad\quad\quad\quad q\gets Q(s,a)$

$\quad\quad\quad\quad \displaystyle Q(s,a)\gets \sum_{s',r}p(s',r | s,a)\Big[r+\gamma \sum_{a'\in\mathcal A(s')}\pi(a' | s')Q(s', a')\Big]$

$\quad\quad\quad\quad \Delta\gets \Delta+\big| q- Q(s,a)\big|$

$\quad\quad \text{until }\Delta<\theta$ a small positive number determining the accuracy of estimation


### 3. Policy Improvement
$\quad\textit{policy-stable}\gets\textit{true}$

$\quad$For each $s\in\mathcal S$

$\quad\quad \textit{old-aciton}\gets\pi(s)$

$\quad\quad \pi(s)\gets\arg\max_a Q(s,a)$

$\quad\quad$If $\textit{old-aciton}\neq\pi(s)$, then $\textit{policy-stable}\gets\textit{false}$

$\quad$If $\textit{policy-stable}$, then stop and return $Q\approx q_*$ and $\pi\approx\pi_*$; else go to $2$


## 6. Exercise 4.6


## 7. Exercise 4.7

## 8. Exercise 4.8

## 9. Exercise 4.9

## 10. Exercise 4.10
Value iteration update for action values, $q_{k+1}(s,a)$

$$
\begin{aligned}
q_{k+1}(s,a) &= \max E\big[R_{t+1}+\gamma v_k(S_{t+1}) | S_t=s,A_t=a\big]
\cr &= \max\sum_{s',r}p(s',r | s,a)\big[r+\gamma v_k(s')\big]
\cr &= \max\sum_{s',r}p(s',r | s,a)\big[r+\gamma\sum_{a'\in\mathcal A(s')}\pi(a' | s')q_k(s', a')\big]
\end{aligned}
$$