# 2 Delayed Markov Decision Processes

## 2.1 Naive markovian policies

To attempt writing fixed-point equations for $V^\pi(s)$ under the given delayed MDP with naive Markovian policies, we first consider the standard Bellman equation in a typical MDP without delays:


$$ V^\pi(s) = r(s, \pi(s)) + \lambda \sum_{s{\prime}} p(s{\prime} \mid s, \pi(s)) V^\pi(s{\prime})$$

In this equation, the value at state $s$ depends directly on the immediate reward received at $s$ and the expected future value $V^\pi(s{\prime})$ of the next state $s{\prime}$, given action $\pi(s)$.

In the delayed MDP scenario at hand, the agent’s action at time $t$ depends on the state from $d$ steps ago, $s_{t-d}$. Specifically, for $t \geq d+1$, the action is $a_t = \pi(s_{t-d})$. This leads to the following problems when trying to write a fixed-point equation for $V^\pi(s)$: 

1. The standard Bellman equation requires the future to be independent of the past given the present state and action, as defined in the lecture on the board or more formal on slide 11 from the first slide deck in the Markovian dynamics. In the given case the action at time $t$ depends on $s_{t-d}$, a past state, rather than the current state $s_t$, thus making the process non-Markovian with respect to $s_t$.

2. Due to the delayed dependence, the immediate reward $r(s_t, a_t)$ and the transition probabilities $p(s_{t+1} \mid s_t, a_t)$ depend on $s_{t-d}$ through $a_t$. This creates a situation where  V^\pi(s)  cannot be expressed solely in terms of  V^\pi(s{\prime})  for states  s{\prime}  reachable from  s  in one step. The recursive structure, as introduced on slide 24 part 1, required for solving a fixed-point equation is broken.






## 2.2 Equivalent augmented state MDP

Agumenting the states with $\bar{s}_t =(s_{t−d},a_{t−d},...,a_{t−1})$ reduces the problem to an MDP with no delay. By augmenting the states this way the MDP becomes Markovian with respect to the augmented state $\bar{s}_t$, thus solving the problems from 2.1. The additional states "absorb" the delays and makes it possible for the agent to operate in an MDP without delays, where standard reinforcement learning techniques can be applied.

Transition Probabilities:

The transition probability from state $\bar{s}_t$ to $\bar{s}_{t+1}$ given action $a_t$ is defined as:

$$ P(\bar{s}_{t+1} \mid \bar{s}_t, a_t) = p(s_{t+1-d} \mid s_{t-d}, a_{t-d}) \times \prod_{k=1}^{d-1} \mathbf{1}\{ a_{t-d+k}{\prime} = a_{t-d+k+1} \} \times \mathbf{1}\{ a_t{\prime} = a_t \}$$

The transition probabilities now depend only on the current augmented state $\bar{s}_t$ and the chosen action $a_t$. This satisfies the Markov property because the future state $\bar{s}_{t+1}$ depends only on $\bar{s}_t$ and $a_t$, not on any prior states or actions outside $\bar{s}_t$.

This enables us to write a Bellman equation for the value function $V^\pi(\bar{s})$:

$$V^\pi(\bar{s}) = r(s_{t-d}, a_{t-d}) + \lambda \sum_{\bar{s}{\prime}} P(\bar{s}{\prime} \mid \bar{s}, \pi(\bar{s})) V^\pi(\bar{s}{\prime})$$

Now $P(\bar{s}_{t+1} \mid \bar{s}_t, a_t)$ is the transition probability from $\bar{s}_t$ to $\bar{s}_{t+1}$ given action $a_t$.

At each time step $t$, the agent observes $\bar{s}_t$ and decides on $a_t$. The next augmented state $\bar{s}_{t+1}$ will be:

$\bar{s}_{t+1} = (s_{t+1-d}, a_{t+1-d}, a_{t+1-d+1}, \ldots, a_t)$




The transition probability from augmented state \( \bar{s}_t \) to \( \bar{s}_{t+1} \) given action \( a_t \) is:

$$P(\bar{s}_{t+1} \mid \bar{s}_t, a_t) = p(s_{t+1-d} \mid s_{t-d}, a_{t-d}) \times \delta(a_{t-d+1}', a_{t-d+1}) \times \delta(a_{t-d+2}', a_{t-d+2}) \times \ldots \times \delta(a_{t-1}', a_{t-1}) \times \delta(a_t', a_t)$$

Reward function: 
The reward function in the augmented MDP is:

$r(\bar{s}_t, a_t) = r(s_{t-d}, a_{t-d})$

The immediate reward at time t depends on the delayed state $s_{t-d}$ and delayed action $a_{t-d}$, both components of $\bar{s}_t$.
The reward does not depend on the current action $a_t$ because the environment provides feedback based on actions taken d steps ago.
Therefore the reward at time $t$ corresponds to the action taken at time $t-d$.
