## [Dec 17] Markov Decision Process IV

Presenter: Yuchen Ge  
Affiliation: University of Oxford  
Contact Email: gycdwwd@gmail.com  
Website: https://yuchenge-am.github.io

Model first, then solve.
----

### 1. MDP Formulation

A ( **Infinite-Horizon** ) MDP $M$ includes 

> **transition** $P: \mathcal{S} \times \mathcal{A} \rightarrow \Delta(\mathcal{S})$ with $P(\cdot \mid s, a)$, **reward** $r: \mathcal{S} \times \mathcal{A} \rightarrow[0,1]$, and an **initial state distribution** $\mu \in \Delta(\mathcal{S})$.

Consider $\tau_{t}=\left(s_{0}, a_{0}, r_{0}, s_{1}, \ldots, s_{t}, a_{t}, r_{t}\right)$ as a trajectory, then 

> a policy is $\pi: \mathcal{H} \rightarrow \Delta(\mathcal{A})$ where $\mathcal{H}$ is the set of all possible trajectories (of all lengths). Specially, we have **stationary policy** $\pi: \mathcal{S} \rightarrow \Delta(\mathcal{A})$, i.e.  $a_{t} \sim \pi\left(\cdot \mid s_{t}\right)$, and **deterministic policy** $\pi: \mathcal{S} \rightarrow \mathcal{A}$.
>
> Set $V^{\pi}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right) \mid \pi, s_{0}=s\right]$ and $Q^{\pi}(s, a)=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right) \mid \pi, s_{0}=s, a_{0}=a\right]$ where $\gamma$ is a discount factor.

The goal of MDP is to solve $\max _{\pi} V^{\pi}(s)$. A stationary policy $\pi$ induces $P_{(s, a),\left(s^{\prime}, a^{\prime}\right)}^{\pi}:=P\left(s^{\prime} \mid s, a\right) \pi\left(a^{\prime} \mid s^{\prime}\right)$.

> For a stationary $\pi$,
> 
> $$ \begin{aligned} V^{\pi}(s) & = \mathbb{E}_{a \sim \pi(\cdot \mid s)} Q^{\pi}(s, a) . \\ Q^{\pi}(s, a) & =r(s, a)+\gamma \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s, a)}\left[V^{\pi}\left(s^{\prime}\right)\right] \end{aligned} \implies Q^{\pi} = r+\gamma P V^{\pi} = r+\gamma P^{\pi} Q^{\pi}.$$
> 
> Here we view  $V^{\pi}$  as vector of length  $|\mathcal{S}|$, $Q^{\pi}$  and  $r$  as vectors of length  $|\mathcal{S}| \cdot|\mathcal{A}|$, $P$  as a matrix of size  $(|\mathcal{S}| \cdot|\mathcal{A}|) \times|\mathcal{S}|$.








If we can show $I-\gamma P^{\pi}$ is invertible, then $Q^{\pi}=\left(I-\gamma P^{\pi}\right)^{-1} r$. This is clear since for $\gamma<1$, $x \neq 0$,

$$\begin{aligned}
\left\|\left(I-\gamma P^{\pi}\right) x\right\|_{\infty} & =\left\|x-\gamma P^{\pi} x\right\|_{\infty} \\
& \geq\|x\|_{\infty}-\gamma\left\|P^{\pi} x\right\|_{\infty} &  \geq\|x\|_{\infty}-\gamma\|x\|_{\infty} & \text { (each element of } P^{\pi} x \text { is an average of } x \text { ) } \\
& =(1-\gamma)\|x\|_{\infty}>0 & (\gamma<1, x \neq 0)
\end{aligned}$$

which implies  $I-\gamma P^{\pi}$  is full rank.

> Define  $\Pi = \{ \text{ non-stationary and randomized policies } \}$, $V^{\star}(s) :=\sup _{\pi \in \Pi} V^{\pi}(s)$ and $Q^{\star}(s, a) :=\sup _{\pi \in \Pi} Q^{\pi}(s, a)$. Then there exists a stationary and deterministic policy  $\pi$  s.t. $V^{\pi}(s) = V^{\star}(s)$ and $Q^{\pi}(s, a) =Q^{\star}(s, a)$.


We refer to such  a $\pi$  as an optimal policy.


The Bellman optimality operator  \mathcal{T}_{M}: \mathbb{R}^{|\mathcal{S}||\mathcal{A}|} \rightarrow \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}  is defined as:

\mathcal{T} Q:=r+\gamma P V_{Q}


$$\mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right) \mid \pi,\left(S_{0}, A_{0}, R_{0}, S_{1}\right)=\left(s, a, r, s^{\prime}\right)\right]=\gamma \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right) \mid \pi_{(s, a, r)}, S_{0}=s^{\prime}\right]=\gamma V^{\pi_{(s, a, r)}}\left(s^{\prime}\right)$$

A **finite-horizon ( and time-dependent )** MDP $M$ differs from the tradtional one only by indexing $P_{h}$ and $r_h$ with $h \in\{0, \ldots H-1\}$. 

>
>
>

---

### Reference

1. Alekh Agarwal. Reinforcement Learning: Theory and Algorithms.
2. Warren B. Powell. Reinforcement Learning and Stochastic Optimization.