## [Dec 17] Markov Decision Process IV

Presenter: Yuchen Ge  
Affiliation: University of Oxford  
Contact Email: gycdwwd@gmail.com  
Website: https://yuchenge-am.github.io

### 1. MDP Formulation

A ( **Infinite-Horizon** ) MDP $M$ includes 

> **transition** $P: \mathcal{S} \times \mathcal{A} \rightarrow \Delta(\mathcal{S})$ with $P(\cdot \mid s, a)$, **reward** $r: \mathcal{S} \times \mathcal{A} \rightarrow[0,1]$, and an **initial state distribution** $\mu \in \Delta(\mathcal{S})$.

Consider $\tau_{t}=\left(s_{0}, a_{0}, r_{0}, s_{1}, \ldots, s_{t}, a_{t}, r_{t}\right)$ as a trajectory, then 

> a policy is $\pi: \mathcal{H} \rightarrow \Delta(\mathcal{A})$ where $\mathcal{H}$ is the set of all possible trajectories (of all lengths). Specially, we have **stationary policy** $\pi: \mathcal{S} \rightarrow \Delta(\mathcal{A})$, i.e.  $a_{t} \sim \pi\left(\cdot \mid s_{t}\right)$, and **deterministic policy** $\pi: \mathcal{S} \rightarrow \mathcal{A}$.
>
> Set $V^{\pi}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right) \mid \pi, s_{0}=s\right]$ and $Q^{\pi}(s, a)=\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right) \mid \pi, s_{0}=s, a_{0}=a\right]$ where $\gamma$ is a discount factor.

The goal of MDP is to solve $\max _{\pi} V^{\pi}(s)$. A stationary policy $\pi$ induces $P_{(s, a),\left(s^{\prime}, a^{\prime}\right)}^{\pi}:=P\left(s^{\prime} \mid s, a\right) \pi\left(a^{\prime} \mid s^{\prime}\right)$.

> For a stationary $\pi$,
> 
> $$ \begin{aligned} V^{\pi}(s) & = \mathbb{E}_{a \sim \pi(\cdot \mid s)} Q^{\pi}(s, a) . \\ Q^{\pi}(s, a) & =r(s, a)+\gamma \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s, a)}\left[V^{\pi}\left(s^{\prime}\right)\right] \end{aligned} \implies Q^{\pi} = r+\gamma P V^{\pi} = r+\gamma P^{\pi} Q^{\pi}.$$
> 
> Here we view  $V^{\pi}$  as vector of length  $|\mathcal{S}|$, $Q^{\pi}$  and  $r$  as vectors of length  $|\mathcal{S}| \cdot|\mathcal{A}|$, $P$  as a matrix of size  $(|\mathcal{S}| \cdot|\mathcal{A}|) \times|\mathcal{S}|$, $P^\pi$ as a matrix as size $(|\mathcal{S}| \cdot|\mathcal{A}|) \times(|\mathcal{S}| \cdot|\mathcal{A}|)$.


If we can show $I-\gamma P^{\pi}$ is invertible, then $Q^{\pi}=\left(I-\gamma P^{\pi}\right)^{-1} r$. This is clear since for $\gamma<1$, $ 0 \neq x  \in \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$,

$$\begin{aligned}
\left\|\left(I-\gamma P^{\pi}\right) x\right\|_{\infty} & =\left\|x-\gamma P^{\pi} x\right\|_{\infty} \\
& \geq\|x\|_{\infty}-\gamma\left\|P^{\pi} x\right\|_{\infty} \\
&  \geq\|x\|_{\infty}-\gamma\|x\|_{\infty} & \text { (each element of } P^{\pi} x \text { is an average of } x \text { ) } \\
& =(1-\gamma)\|x\|_{\infty}>0 & (\gamma<1, x \neq 0)
\end{aligned}$$

which implies  $I-\gamma P^{\pi}$  is full rank.

> Define  $\Pi = \{ \text{ non-stationary and randomized policies } \}$, $V^{\star}(s) :=\sup _{\pi \in \Pi} V^{\pi}(s)$ and $Q^{\star}(s, a) :=\sup _{\pi \in \Pi} Q^{\pi}(s, a)$. Then there exists a stationary and deterministic policy  $\pi$  s.t. $V^{\pi}(s) = V^{\star}(s)$ and $Q^{\pi}(s, a) =Q^{\star}(s, a)$.

**Proof.** Simple computation shows that 

$$\begin{aligned}
V^{\star}\left(s_{0}\right) & = \sup _{\pi \in \Pi} \mathbb{E}\left[r\left(s_{0}, a_{0}\right)+\mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right) \mid \pi,\left(S_{0}, A_{0}, R_{0}, S_{1}\right)=\left(s_{0}, a_{0}, r_{0}, s_{1}\right)\right]\right] \\
& \leq \sup _{\pi \in \Pi} \mathbb{E}\left[r\left(s_{0}, a_{0}\right)+\sup _{\pi^{\prime} \in \Pi} \mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right) \mid \pi^{\prime},\left(S_{0}, A_{0}, R_{0}, S_{1}\right)=\left(s_{0}, a_{0}, r_{0}, s_{1}\right)\right]\right] \\
& = \sup _{\pi \in \Pi} \mathbb{E}\left[r\left(s_{0}, a_{0}\right)+\gamma V^{\star}\left(s_{1}\right)\right] \\
& =\sup _{a_{0} \in \mathcal{A}} \mathbb{E}\left[r\left(s_{0}, a_{0}\right)+\gamma V^{\star}\left(s_{1}\right)\right].
\end{aligned}$$

where the first equality uses the law of iterated expectations. Define $$\pi^{\star}
(s)=\operatorname{argmax}_{a \in \mathcal{A}}\mathbb{E}\left[r(s, a)+\gamma V^{\star}\left(s_{1}\right) \mid\left(S_{0}, A_{0}\right)=(s, a)\right]$$ 

i.e. $\pi^{\star}=\pi_{Q^{\star}}$ where $\pi_{Q}(s):=\operatorname{argmax}_{a \in \mathcal{A}} Q(s, a)$. And we have

$$V^{\star}\left(s_{0}\right) \leq \mathbb{E}\left[r\left(s_{0}, a_{0}\right)+\gamma V^{\star}\left(s_{1}\right) \mid \pi^{\star}\right] \leq \mathbb{E}\left[r\left(s_{0}, a_{0}\right)+\gamma r\left(s_{1}, a_{1}\right)+\gamma^{2} V^{\star}\left(s_{2}\right) \mid \pi^{\star}\right] \leq \ldots \leq V^{\pi^{\star}}\left(s_{0}\right)
$$
which finishes the proof. We refer to such  a $\pi^{\star}$  as an optimal policy.

This shows that we may restrict ourselves to using stationary and deterministic policies without any loss in performance. 

> Having defined $V_{Q}(s):=\max _{a \in \mathcal{A}} Q(s, a)$, the Bellman optimality operator $\mathcal{T}_{M}: \mathbb{R}^{|\mathcal{S}||\mathcal{A}|} \rightarrow \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$ is defined as:
>
> $$\mathcal{T} Q:=r+\gamma P V_{Q}.$$ 

Then we have the famous

> ( **Bellman optimality equations** ) $Q=Q^{\star}$ iff $Q=\mathcal{T}Q$.


**Proof.** Sufficiency is easily shown via observing that $V^{\star}(s)=\max _{a} Q^{\star}(s, a)$, which follows easily that 

$$\begin{aligned}
Q^{\star}(s, a) & =\max _{\pi} Q^{\pi}(s, a)=r(s, a)+\gamma \max _{\pi} \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s, a)}\left[V^{\pi}\left(s^{\prime}\right)\right] \\
& =r(s, a)+\gamma \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s, a)}\left[V^{\star}\left(s^{\prime}\right)\right] \\
& =r(s, a)+\gamma \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s, a)}\left[\max _{a^{\prime}} Q^{\star}\left(s^{\prime}, a^{\prime}\right)\right].
\end{aligned}$$

For necessity, suppose $Q=\mathcal{T} Q$. We now show that $Q=Q^{\star}$. Let $\pi=\pi_{Q}$. Then we have 

$$  Q = \mathcal{T} Q = r+\gamma P^{\pi} Q \implies Q=\left(I-\gamma P^{\pi}\right)^{-1} r=Q^{\pi},$$

which follows that $Q$ is the action value of the policy $\pi_Q$. Therefore, 

$$\begin{aligned}
Q-Q^{\pi^{\prime}} & =Q^{\pi}-Q^{\pi^{\prime}} =Q^{\pi}-\left(I-\gamma P^{\pi^{\prime}}\right)^{-1} r \\
& =\left(I-\gamma P^{\pi^{\prime}}\right)^{-1}\left(\left(I-\gamma P^{\pi^{\prime}}\right)-\left(I-\gamma P^{\pi}\right)\right) Q^{\pi} \\
& =\gamma\left(I-\gamma P^{\pi^{\prime}}\right)^{-1}\left(P^{\pi}-P^{\pi^{\prime}}\right) Q^{\pi} \geq 0
\end{aligned}$$


where we have used 

> $\left[\left(I-\gamma P^{\pi}\right)^{-1}\right]_{(s, a),\left(s^{\prime}, a^{\prime}\right)}= \sum_{t=0}^{\infty} \gamma^{t} \mathbb{P}^{\pi}\left(s_{t}=s^{\prime}, a_{t}=a^{\prime} \mid s_{0}=s, a_{0}=a\right)$;
>
> $\left[\left(P^{\pi}-P^{\pi^{\prime}}\right) Q^{\pi}\right]_{s, a}=\mathbb{E}_{s^{\prime} \sim P(\cdot \mid s, a)}\left[Q^{\pi}\left(s^{\prime}, \pi\left(s^{\prime}\right)\right)-Q^{\pi}\left(s^{\prime}, \pi^{\prime}\left(s^{\prime}\right)\right)\right] \geq 0$.

Thus,  $Q^{\pi}=Q \geq Q^{\pi^{\prime}}$  for all deterministic and stationary  $\pi^{\prime}$, which shows that  $\pi$  is an optimal policy. This completes the proof.

---

### 2. Finite-horizon MDP 

A **finite-horizon ( and time-dependent )** MDP $M$ differs from the tradtional one only by $P$ and $r$ with $P_{h}$ and $r_h$, $h \in\{0, \ldots H-1\}$. 

>  $V_{h}^{\pi}(s)=\mathbb{E}\left[\sum_{t=h}^{H-1} r_{h}\left(s_{t}, a_{t}\right) \mid \pi, s_{h}=s\right]$;
>
> $Q_{h}^{\pi}(s, a)=\mathbb{E}\left[\sum_{t=h}^{H-1} r_{h}\left(s_{t}, a_{t}\right) \mid \pi, s_{h}=s, a_{h}=a\right]$.

And correspondingly we have ( proof is via defining $s_t, a_t$ as trivial ones, $t>T$ )

> ( **Bellman optimality equations** ) Define $Q_{h}^{\star}(s, a)=\sup _{\pi \in \Pi} Q_{h}^{\pi}(s, a)$ where the sup is over all non-stationary and randomized policies. Suppose that  $Q_{H}=0$. We have that  $Q_{h}=Q_{h}^{\star}$  for all  $h \in[H]$  iff for all  $h \in[H]$,
>
> $$Q_{h}(s, a)=r_{h}(s, a)+\mathbb{E}_{s^{\prime} \sim P_{h}(\cdot \mid s, a)}\left[\max _{a^{\prime} \in \mathcal{A}} Q_{h+1}\left(s^{\prime}, a^{\prime}\right)\right].$$


Furthermore,  $\pi(s, h)=\operatorname{argmax}_{a \in \mathcal{A}} Q_{h}^{\star}(s, a)$  is an optimal policy.

---

### 3. Computational Complexity

Define

> $L(P, r, \gamma)$  denote the total bit-size required to specify  $M$ where $(P, r, \gamma)$  in $M $ is specified with rational entries. 
>
> Also we assume that basic arithmetic operations, $+, -, \times, \div$  take unit time. 

Then we study the value iteration method.

> **Lemma I.** $ \left\|\mathcal{T} Q-\mathcal{T} Q^{\prime}\right\|_{\infty} \leq \gamma\left\|Q-Q^{\prime}\right\|_{\infty}. $
>
> **Lemma II.** For any $Q \in \mathbb{R}^{|\mathcal{S}||\mathcal{A}|}$, $V^{\pi_{Q}} \geq V^{\star}-\frac{2\left\|Q-Q^{\star}\right\|_{\infty}}{1-\gamma} \mathbb{1}.$

**Proof.** For the first, simple computation shows that 

$$\begin{aligned}
\left\|\mathcal{T} Q-\mathcal{T} Q^{\prime}\right\|_{\infty} & =\gamma\left\|P V_{Q}-P V_{Q^{\prime}}\right\|_{\infty} =\gamma\left\|P\left(V_{Q}-V_{Q^{\prime}}\right)\right\|_{\infty} \leq \gamma\left\|V_{Q}-V_{Q^{\prime}}\right\|_{\infty} \\
& = \gamma \max _{s}\left|V_{Q}(s)-V_{Q^{\prime}}(s)\right| \leq \gamma \max _{s} \max _{a}\left|Q(s, a)-Q^{\prime}(s, a)\right| =\gamma\left\|Q-Q^{\prime}\right\|_{\infty}.
\end{aligned}$$

For the second, by letting $a=\pi_{Q}(s)$,

$$\begin{aligned}
V^{\star}(s)-V^{\pi_{Q}}(s)= & Q^{\star}\left(s, \pi^{\star}(s)\right)-Q^{\pi_{Q}}(s, a) \\
= & Q^{\star}\left(s, \pi^{\star}(s)\right)-Q^{\star}(s, a)+Q^{\star}(s, a)-Q^{\pi_{Q}}(s, a) \\
= & Q^{\star}\left(s, \pi^{\star}(s)\right)-Q^{\star}(s, a)+\gamma \mathbb{E}_{s^{\prime} \sim P(\cdot \mid s, a)}\left[V^{\star}\left(s^{\prime}\right)-V^{\pi_{Q}}\left(s^{\prime}\right)\right] \\
\leq & Q^{\star}\left(s, \pi^{\star}(s)\right)-Q\left(s, \pi^{\star}(s)\right)+Q(s, a)-Q^{\star}(s, a) +\gamma \mathbb{E}_{s^{\prime} \sim P(s, a)}\left[V^{\star}\left(s^{\prime}\right)-V^{\pi_{Q}}\left(s^{\prime}\right)\right] \\
\leq & 2\left\|Q-Q^{\star}\right\|_{\infty}+\gamma\left\|V^{\star}-V^{\pi_{Q}}\right\|_{\infty} .
\end{aligned}$$

Therefore, we have

> ( **Q-value iteration convergence** ) Set  $Q^{(0)}=0$. Suppose $Q^{(k+1)}=\mathcal{T} Q^{(k)}$ and $\pi^{(k)}=\pi_{Q^{(k)}}$. Then for  sufficiently large $k$,
>
> $$V^{\pi^{(k)}} \geq V^{\star}-\epsilon \mathbb{1}$$

**Proof** Since  $\left\|Q^{\star}\right\|_{\infty} \leq 1 /(1-\gamma)$, $Q^{(k)}=\mathcal{T}^{k} Q^{(0)}$  and $Q^{\star}=\mathcal{T} Q^{\star}$, lemma I give

$$\left\|Q^{(k)}-Q^{\star}\right\|_{\infty}=\left\|\mathcal{T}^{k} Q^{(0)}-\mathcal{T}^{k} Q^{\star}\right\|_{\infty} \leq \gamma^{k}\left\|Q^{(0)}-Q^{\star}\right\|_{\infty}=(1-(1-\gamma))^{k}\left\|Q^{\star}\right\|_{\infty} \leq \frac{\exp (-(1-\gamma) k)}{1-\gamma},$$

which immediately follows the conclusion with lemma II. Here $k \geq \frac{\log \frac{2}{(1-\gamma)^{2} \epsilon}}{1-\gamma}$ will satisfy the needs. Iteration complexity for an exact solution. Then w.r.t. computing an exact optimal policy, when the gap between the current objective value and the optimal objective value is smaller than  $2^{-L(P, r, \gamma)}$, then the greedy policy will be optimal, which follows that 

> The value iteration method has computational complexity $|\mathcal{S}|^{2}|\mathcal{A}| \frac{L(P, r, \gamma) \log \frac{1}{1-\gamma}}{1-\gamma}$. ( universal constants are dropped )

We shall also consider the following policy iteration algorithm.

> 1. **Policy evaluation.** Compute  $Q^{\pi_{k}}$; ( use $Q^{\pi}=\left(I-\gamma P^{\pi}\right)^{-1} r$ )
> 2. **Policy improvement.** Update the policy: $\pi_{k+1}=\pi_{Q^{\pi_{k}}}$.

To obtain similar conclusions, we observe that

> $Q^{\pi_{k+1}} \geq \mathcal{T} Q^{\pi_{k}} \geq Q^{\pi_{k}} \implies \left\|Q^{\pi_{k+1}}-Q^{\star}\right\|_{\infty} \leq \gamma\left\|Q^{\pi_{k}}-Q^{\star}\right\|_{\infty}$,

which follows that 

> ( **Policy iteration convergence** ) Let  $\pi_{0}$  be any initial policy. Then for $k \geq \frac{\log \frac{1}{(1-\gamma) \epsilon}}{1-\gamma}$, we have the following performance bound:
>
> $$ Q^{\pi_{k}} \geq Q^{\star}-\epsilon \mathbb{1} .$$

Hence, w.r.t. computing an exact optimal policy, policy iteration is no worse than value iteration. However, w.r.t. computing an exact optimal policy independent of the bit complexity $L(P, r, \gamma)$, improvements are possible. 

Another way is to consider the following optimization problem with variables  $V \in \mathbb{R}^{|\mathcal{S}|}$:

$$\begin{aligned}
\min & \sum_{s} \mu(s) V(s) \\
\text { s.t. } & V(s) \geq r(s, a)+\gamma \sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) V\left(s^{\prime}\right) \quad \forall a \in \mathcal{A}, s \in \mathcal{S}
\end{aligned}$$


Provided that  $\mu$  has full support, then the optimal value function  $V^{\star}(s)$  is the unique solution to this linear program. Thus, this approach will only depend on the bit length description of the MDP, i.e. $L(P, r, \gamma)$.

---

### 4. Sample Complexity

We are interested understanding the number of samples required to find a near optimal policy, which is the sample complexity.

> $\widehat{M}$  is the empirical MDP that is identical to the original  $M$, except that it uses  $\widehat{P}$  instead of  P  for the transition model. 
>
> $\widehat{V}^{\pi}, \widehat{Q}^{\pi}, \widehat{Q}^{\star}$, and  $\widehat{\pi}^{\star}$  denote the value function, state-action value function, optimal state-action value, and optimal policy in  $\widehat{M}$, respectively.

A generative model takes as input a state action pair  $(s, a)$  and returns a sample  $s^{\prime} \sim P(\cdot \mid s, a)$  and the reward  $r(s, a)$. 

Let us consider the most naive approach to learning when we have access to a generative model: suppose we call our simulator  $N$  times at each state action pair. Let  $\widehat{P}$  be our empirical model, 

$$\widehat{P}\left(s^{\prime} \mid s, a\right)=\frac{\operatorname{count}\left(s^{\prime}, s, a\right)}{N}$$

where count  $\left(s^{\prime}, s, a\right)$  is the number of times the state-action pair  $(s, a)$  transitions to state  $s^{\prime}$. We can view  $\widehat{P}$  as a matrix of size  $|\mathcal{S}||\mathcal{A}| \times|\mathcal{S}|$.

Then using McDiarmid's Inequality, we obtain the concentration bounds 

$$\|P(\cdot \mid s, a)-\widehat{P}(\cdot \mid s, a)\|_{1} \leq c \sqrt{\frac{|\mathcal{S}| \log (1 / \delta)}{m}}$$

with probability greater than 1 − $\delta$ where $m$ is the number of samples.

It follows that 

$$\begin{aligned}
\left\|Q^{\pi}-\widehat{Q}^{\pi}\right\|_{\infty} & =\left\|\gamma\left(I-\gamma \widehat{P}^{\pi}\right)^{-1}(P-\widehat{P}) V^{\pi}\right\|_{\infty} \leq \frac{\gamma}{1-\gamma}\left\|(P-\widehat{P}) V^{\pi}\right\|_{\infty} \\
& \leq \frac{\gamma}{1-\gamma}\left(\max _{s, a}\|P(\cdot \mid s, a)-\widehat{P}(\cdot \mid s, a)\|_{1}\right)\left\|V^{\pi}\right\|_{\infty} \leq \frac{\gamma}{(1-\gamma)^{2}} \max _{s, a}\|P(\cdot \mid s, a)-\widehat{P}(\cdot \mid s, a)\|_{1}
\end{aligned}$$

Add that $\max _{s, a}\|P(\cdot \mid s, a)-\widehat{P}(\cdot \mid s, a)\|_{1}$ describes model accuracy, $\left\|Q^{\pi}-\widehat{Q}^{\pi}\right\|_{\infty}$ describes uniform value accuracy, and $\|\widehat{Q}^{\star}-Q^{\star}\|_{\infty}$ describes the near optimal planning. Then we know that 

> **Thm.** If $\# \{ \text{samples from generative model} \} =|\mathcal{S}||\mathcal{A}|N$ is large enough ( w.r.t. $S, A, \delta, \gamma$ ), then we have model accuracy, uniform value accuracy, and near optimal planning.

Here the observation for near optimal planning follows from $$\left|\widehat{Q}^{\star}(s, a)-Q^{\star}(s, a)\right|=\left|\sup \widehat{Q}^{\pi}(s, a)-\sup Q^{\pi}(s, a)\right| \leq \sup \left|\widehat{Q}^{\pi}(s, a)-Q^{\pi}(s, a)\right|.$$

---

### Reference

1. Alekh Agarwal. Reinforcement Learning: Theory and Algorithms.
2. Warren B. Powell. Reinforcement Learning and Stochastic Optimization.