# Chapter 3: Finite Markov Decision Processes

## 1. Agent-Environment Interface
![agent-enviroment](assets/agent-env.png)

- Interact at each of sequence of discrete time steps, $t=0,1,2,3,...$
- Each step $t$
    - agent receives enviroment's **state** $S_t \in \mathcal S$
    - agent selects **action** $A_t \in \mathcal A(s)$
- After that
    - enviroment emits **reward** $R_{t+1}\in \mathcal R \in \mathbb R$
    - enviroment emits **state** $S_{t+1}\in \mathcal S$
- Trajectory (history) $S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3,...$
- Finite MDP: $|\mathcal S|, |\mathcal A|, |\mathcal R|$ are finite
- **Dynamics of MDP**:
$$
\begin{aligned}
p(s^{\prime}, r | s,a) &= Pr\{S_t=s^{\prime}, R_t=r | S_{t-1}=s, A_{t-1}=a\}
\\
p(s^{\prime} | s,a) &= Pr\{S_t=s^{\prime} | S_{t-1}=s, A_{t-1}=a\} = \sum_{r\in\mathcal R} p(s^{\prime}, r | s,a)
\\
r(s,a) &= E\{R_{t+1} | S_t=s, A_t=a\} = \sum_{r\in\mathcal R} r \sum_{s^{\prime}\in\mathcal S} p(s^{\prime}, r | s,a)
\end{aligned}
$$
- **Markov Property**: Probability of each possible value for $S_t, R_t$ depends only on $S_{t-1}, A_{t-1}$

## 2. Goals and Returns
- **Goal**: Maximize the expected value of the cumulative reward in the long run
- Maximize the *expected discounted return*:
$$G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...=\sum_{k=t+1}^T \gamma^{k-t-1}R_k=R_{t+1}+\gamma G_{t+1}$$
  where $0\le\gamma < 1$ is the *discount rate*
  - Possiblity that, $T=\infty \text{ or } \gamma=1$, but not both
  

## 3. Policies and Value Functions
- Policy: mapping from states to probabilities of selecting each possible action
$$\pi(a | s) = Pr(A_t = a | S_t = s)$$
- State-Value function for policy $\pi$
$$v_{\pi}(s) = E_{\pi}[G_t | S_t=s]=E_{\pi}\Big[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1} | S_t=s\Big], ~~~\forall s\in\mathcal S$$
- Action-Value funciton for policy $\pi$
$$q_{\pi}(s, a) = E_{\pi}[G_t | S_t=s, A_t=a]=E_{\pi}\Big[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1} | S_t=s, A_t=a\Big]$$
- Bellman equation for $v_\pi$
$$
\begin{aligned}
v_\pi(s) &= E_{\pi}[G_t | S_t=s]
\\ &= E_{\pi}[R_{t+1}+\gamma G_{t+1} | S_t=s]
\\ &= \sum_a\pi(a | s)\sum_{s^\prime}\sum_r p(s^\prime, r | s,a)\Big[ r+\gamma E_\pi(G_{t+1} | S_{t+1}=s^\prime) \Big]
\\ &= \sum_a\pi(a | s)\sum_{s^\prime, r} p(s^\prime, r | s,a)\Big[ r+\gamma v_\pi(s^\prime) \Big]
\end{aligned}
$$


## 4. Optimal Policies and Optimal Value Functions
- find a policy that achieves a lot of reward over the long run
- $\pi$ is better than $\pi^\prime$: $\pi \ge \pi^\prime \text{ iff } v_\pi(s) \ge v_{\pi^\prime}(s), ~~~\forall s\in\mathcal S$
- Optimal Policy: $$\pi_*=\max(\pi), ~~~\forall \pi$$
- Optimal state-value function: $$v_*(s) = \max\limits_\pi v_\pi(s), ~~~\forall s\in\mathcal S$$
- Optimal action-value funciton: $$q_*(s) = \max\limits_\pi v_\pi(s, a)=E\Big[(R_{t+1}+\gamma v_*(S_{t+1}) | S_t=s,A_t=a\Big], ~~~\forall s\in\mathcal S, a\in\mathcal A(s)$$

- Hard, extensive to learn optimal policy, so, can only approximate to varying degress