<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" align="left" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>&nbsp;| [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en) | <a href="https://supaerodatascience.github.io/reinforcement-learning/">https://supaerodatascience.github.io/reinforcement-learning/</a>

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Class 1: Markov Decision Processes,<br> a common modeling framework for sequential decision making</div>

<div class="alert alert-success">

**Learning outcomes:**  
This section is mostly about definitions. By the end of this session you should be able to define what is:
- a Markov Decision Process,
- an action policy,
- a value function.
</div>

## Modeling sequential decision problems with Markov Decision Processes (30 minutes)

### Definition

Let's take a higher view and develop a general theory for describing problems such as writing a prescription for our patient.

Let us assume we have:
- a set of states $S$ describing the system to control,
- a set of actions $A$ we can apply.

Curing patients is a conceptually difficult task. 
To keep things grounded, we shall use a toy example called [FrozenLake](https://gym.openai.com/envs/FrozenLake-v0/) and work our way to more general concepts. It's also the occasion to familiarize with [OpenAI Gym](https://gym.openai.com/).

<img src="img/frisbee.jpg" style="height: 300px;"></img>

In [None]:
import gym
import gym.envs.toy_text.frozen_lake as fl

env = gym.make('FrozenLake-v1', render_mode="ansi") # use render_mode="human" to open the game window
env.reset()
print(env.render())

In [None]:
# useful only if you have used render_mode="human" in the cell above
env.close()

The game's goal is to navigate across this lake, from position S to position G, in order to retrieve a frisbee, while avoiding falling into the holes H. Frozen positions are slippery so you don't always move in the intended direction. Reaching the goal provides a reward of 1, and zero otherwise. Falling into a hole or reaching the goal ends an episode.

Take a look at the funny description in `help(fl.FrozenLakeEnv)` if you are curious.

<div class="alert alert-warning">
    
**Poll:**  
[https://linkto.run/p/65E9EO4Q](https://linkto.run/p/65E9EO4Q)  
How many states are there in this game?  
How many actions?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

States set: the 16 positions on the map.  
Actions set: the 4 actions $\{$N,S,E,W$\}$.
</details>

Let's confirm that:

In [None]:
print(env.observation_space)
print(env.action_space)

At every time step, the system state is $S_t$ and we decide to apply action $A_t$. This results in observing a new state $S_{t+1}$ and receiving a scalar reward signal $R_t$ for this transition.

$R_t$ tells us how happy we are with the last transition.

For example, in FrozenLake, all transitions have reward 0 except for the one that reaches the goal, which yields reward 1. Let's verify this and introduce a few utility functions on the way.

Note that $S_t$, $A_t$, $S_{t+1}$ and $R_t$ are random variables.

In [None]:
actions = {fl.LEFT: '\u2190', fl.DOWN: '\u2193', fl.RIGHT: '\u2192', fl.UP: '\u2191'}

def to_s(row,col):
    return row*env.unwrapped.ncol+col

def to_row_col(s):
    col = s%env.unwrapped.ncol
    row = int((s-col)/env.unwrapped.ncol)
    return row,col

print(actions)
row=3
col=2
a=2
print("Apply ", actions[2], " from (", row, ", ", col, "):", sep='')
for tr in env.unwrapped.P[to_s(row,col)][a]:
    print("  Reach (", to_row_col(tr[1]), ") and get reward ", tr[2], " with proba ", tr[0], ".", sep='')

We will now make our main assumption about the systems we want to control.

<div class="alert alert-success">
    
**Fundamental assumption (Markov property)**
$$\mathbb{P}(S_{t+1},R_t|S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = \mathbb{P}(S_{t+1},R_t|S_t, A_t)$$
</div>
    
Such a system will be called a Markov Decision Process (MDP).

One generally separates the state dynamics and the rewards by:
$$\mathbb{P}(S_{t+1},R_t|S_t, A_t) = \mathbb{P}(S_{t+1}|S_t, A_t)\cdot \mathbb{P}(R_t|S_t, A_t, S_{t+1})$$

Which leads in turn to the general definition of an MDP:
<div class="alert alert-success"><b>Markov Decision Process (MDP)</b><br>
A Markov Decision Process is given by:
<ul>
<li> A set of states $S$
<li> A set of actions $A$
<li> A (Markovian) transition model $\mathbb{P}\left(S_{t+1} | S_t, A_t \right)$, noted $p(s'|s,a)$
<li> A reward model $\mathbb{P}\left( R_t | S_t, A_t, S_{t+1} \right)$, noted $r(s,a)$ or $r(s,a,s')$
<li> A set of discrete decision epochs $T=\{0,1,\ldots,H\}$
</ul>
</div>

Most of the results presented here can be found in M. L. Puterman's classic book, [Markov Decision Processes: Discrete Stochastic Dynamic Programming](https://www.wiley.com/en-us/Markov+Decision+Processes%3A+Discrete+Stochastic+Dynamic+Programming-p-9781118625873).

If $H\rightarrow\infty$ we have an infinite horizon control problem.

<div class="alert alert-success">

Since we will only work with infinite horizon problems, we shall identify the MDP with the 4-tuple $\langle S,A,p,r\rangle$.
</div>
    
So, in RL, we wish to control the trajectory of a system that, we suppose, behaves as a Markov Decision Process.

<img src="img/dynamic.png" style="height: 240px;"></img>

### Value of a trajectory / of a policy

An oracle decides on how to choose actions at each time step:
$$A_t \sim \pi_t.$$

$\pi_t$ is called the **decision rule** at step $t$, it is a distribution over the action space $A$.  
The collection $\pi = \left(\pi_t \right)_{t\in\mathbb{N}}$ is the oracle's **policy**.

<img src="img/frisbee.jpg" style="height: 100px;"></img>

One policy implies one specific distribution over trajectories over the frozen lake. More generally, the policy and $S_0$ condition the sequence $S_0, A_0, R_0, S_1, A_1, R_1, \ldots$

In FrozenLake as in the patient's example, some trajectories are better than others. We need a criterion to compare trajectories. Intuitively, this criterion should reflect the idea that a good policy accumulates as much reward as possible along a trajectory.

Let's compare the policy that always moves to the right and the policy that always moves left by summing the rewards obtained along trajectories and then averaging these rewards across trajectories.

In [None]:
import numpy as np
nb_episodes = 50000
horizon = 200

Vright = np.zeros(nb_episodes)
for i in range(nb_episodes):
    env.reset()
    for t in range(horizon):
        next_state, r, done, _, _ = env.step(fl.RIGHT)
        Vright[i] += r
        if done:
            break

Vleft  = np.zeros(nb_episodes)
for i in range(nb_episodes):
    env.reset()
    for t in range(horizon):
        next_state, r, done, _, _ = env.step(fl.LEFT)
        Vleft[i] += r
        if done:
            break

print("est. value of 'right' policy:", np.mean(Vright), "variance:", np.std(Vright))
print("est. value of 'left'  policy:", np.mean(Vleft),  "variance:", np.std(Vleft))

In the general case, this sum of rewards on an infinite horizon might be unbounded. So let us introduce the **$\gamma$-discounted sum of rewards** (from a starting state $s$, under policy $\pi$) random variable:
$$G^\pi(s) = \sum\limits_{t = 0}^\infty \gamma^t R_t \quad \Bigg| \quad \begin{array}{l}S_0 = s,\\ A_t \sim \pi_t,\\ S_{t+1}\sim p(\cdot|S_t,A_t),\\R_t = r(S_t,A_t,S_{t+1}).\end{array}$$

$G^\pi(s)$ represents what we can gain in the long-term by applying the actions from $\pi$.

Then, given a starting state $s$, we can define the value of $s$ under policy $\pi$:
$$V^\pi(s) = \mathbb{E} \left[ G^\pi(s) \right]$$

This defines the value function $V^\pi$ of policy $\pi$:
<div class="alert alert-success"><b>Value function $V^\pi$ of a policy $\pi$ under a $\gamma$-discounted criterion</b><br>
$$V^\pi : \left\{\begin{array}{ccl}
S & \rightarrow & \mathbb{R}\\
s & \mapsto & V^\pi(s)=\mathbb{E}\left( \sum\limits_{t = 0}^\infty \gamma^t R_t \bigg| S_0 = s, \pi \right)\end{array}\right. $$
</div>


And, given a distribution $\rho_0$ on starting states, we can map $\pi$ to the scalar value:
$$J(\pi) = \mathbb{E}_{s \sim \rho_0} \left[ V^\pi(s) \right]$$

Note that this definition is quite arbitrary: instead of the expected (discounted) sum of rewards, we could have taken the average reward over all time steps, or some other (more or less exotic) comparison criterion between policies.

Most of the RL literature uses this discounted criterion (in some cases with $\gamma=1$), some uses the average reward criterion, and few works venture into more exotic criteria. For now, we will limit ourselves to the discounted criterion.

### Optimal policies

The fog clears up a bit: we can now compare policies given an initial state (or initial state distribution).  

Thus, an **optimal** policy is one that is better than any other.

<div class="alert alert-success"><b>Optimal policy $\pi^*$</b><br>
$\pi^*$ is said to be optimal iff $\pi^* \in \arg\max\limits_{\pi} V^\pi$.<br>
<br>
    
A policy is optimal if it **dominates** over any other policy in every state:
$$\pi^* \textrm{ is optimal}\Leftrightarrow \forall s\in S, \ \forall \pi, \ V^{\pi^*}(s) \geq V^\pi(s)$$
</div>

Note that although there may be several optimal policies, they all share the same value function $V^* = V^{\pi^*}$.

We now get to our first fundamental result. Fortunately for us...  

<div class="alert alert-success"><b>Optimal policy theorem</b><br>
For $\left\{\begin{array}{l}
\gamma\textrm{-discounted criterion}\\
\textrm{infinite horizon}
\end{array}\right.$, 
there always exists at least one optimal stationary, deterministic, Markovian policy.
</div>

Let's explain a little:
- Markovian: all decision rules are only conditioned by the last seen state. Mathematically: 
$\left\{\begin{array}{l}
\forall \left(s_i,a_i\right)_{i\leq t-1}\in \left(S\times A\right)^{t-1}\\
\forall \left(s'_i,a'_i\right)_{i\leq t-1}\in \left(S\times A\right)^{t-1}\\
\forall s \in S
\end{array}\right., \pi_t\left(A_t|S_0=s_0, A_0=a_0, \ldots, S_t=s\right) = \pi_t\left(A_t|S'_0=s'_0, A'_0=a'_0, \ldots, S_t=s\right)$.  
One writes $\pi_t(A_t|S_t=s)$, or more simply $\pi_t(\cdot | s)$.
- Stationary (and Markovian): all decision rules are the same throughout time. Mathematically:  
$\forall (t,t')\in \mathbb{N}^2, \pi_t(A_t|S_t=s) = \pi_t(A_{t'}|S_{t'}=s)$.  
This unique distribution is written $\pi(\cdot | s) = \pi_t( \cdot | s)$.
- Deterministic: all decision rules put all probability mass on a single item of the action space $A$.  
$\pi_t(A_t|history) = \left\{\begin{array}{l}
1\textrm{ for a single }a\\
0\textrm{ otherwise}
\end{array}\right.$.

So in simpler words, we know that among all possible optimal ways of picking $A_t$, at least one is a function $\pi:S\rightarrow A$.

That helps a lot: we don't have to search for optimal policies in a complex family of history-dependent, stochastic, non-stationary policies; instead we can simply search for a function $\pi(s)=a$ that maps states to actions.

### Summary

Let's wrap this whole section up. Our goal was to formally define the search for the best strategy for our game of FrozenLake and the medical prescription problem. This has led us to formalizing the general **discrete-time stochastic optimal control problem**:
- Environment (discrete time, non-deterministic, non-linear, Markov) $\leftrightarrow$ MDP.
- Behaviour $\leftrightarrow$ control policy $\pi : s\mapsto a$.
- Policy evaluation criterion $\leftrightarrow$ $\gamma$-discounted criterion.
- Goal $\leftrightarrow$ Maximize value function $V^\pi(s)$.

So we have built the first stage of our three-stage rocket:  
<div class="alert alert-success">
    
**What is the system to control?**  
The system to control is a Markov Decision Process $\langle S, A, p, r \rangle$ and we will control it with a policy $\pi:s\mapsto a$ in order to optimize $\mathbb{E} \left( \sum_t \gamma^t R_t\right)$
</div>

<div class="alert alert-warning">

**Poll** The limits of MDP modeling  
[https://linkto.run/p/0WG7WNER](https://linkto.run/p/0WG7WNER)  
Can these systems be modeled as MDPs?   
- Playing a tennis video game based on a single video frame
- Playing a tennis video game based on a full physical description of the ball and the players
- The game of Poker
- The collaborative game of [Hanabi](https://en.wikipedia.org/wiki/Hanabi_(card_game))
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

A single video frame does not contain enough information to accurately represent the current state of the game. The velocities are absent for instance. Hence the dynamics might not be Markovian.
    
A full physical description, however, may contain enough information so that $\mathbb{P}(S_{t+1})$ is only conditioned by $S_t$ and $A_t$.
    
Poker is a two-player, adversarial, stochastic game. MDPs only model one-player games.

Beyond the fact that it is a multi-player game. Hanabi is a game based mainly on epistemic reasoning. That is, reasoning on beliefs about the state of the world (specifically, the state of the other players' hand). This type of state description is difficult to encode within a Markovian dynamics model.

**A bit of additional discussion to generalize these notions:**

What if the system is an MDP but its state is not fully observable?  
$\rightarrow$ This is the (exciting) field of Partially Observable MDPs. Our key result of having a Markovian optimal policy does not hold anymore. There are ways to still obtain optimal policies (but it is often very computationaly costly) or approximate them with Markovian policies.

What happens if there are multiple actions taken at the same time by different agents?  
$\rightarrow$ This falls into the category of multi-player stochastic games. Such games can be adversarial, cooperative, or a mix of the two. Of course they can also have partial observability.

What if the transition model is not Markovian?  
$\rightarrow$ Beware, here be dragons! All the beautiful framework above crumbles down if its hypothesis are violated. So great care should be taken when choosing the state variables for a given problem. In a sense, an MDP is a discrete time version of a first-order differential equation. Writing a system as $\dot{X} = f(X,U, noise)$ as is common in Control Theory is a good practice to ensure the Markov property.
</details>

### Homework: MDP notions

The exercises below are here to help you play with the concepts introduced above, to better grasp them. They are not optional to reach the class goals. Often, the provided answer reaches out further than the plain question asked and provides comments, additional insights, or external references.

<div class="alert alert-warning">
    
**Exercise**  
In the text above, we wrote that $\pi_t$ is the distribution over the action space $A$ for the action $A_t$ taken at time step $t$.  
- Write this probability $\mathbb{P}(A_t)$ as a conditional probability $\pi_t(A_t|\ldots)$ (the real question is: what are the $\ldots$?).
- Rephrase, with your own words, what this $\pi_t(A_t|\ldots)$ indicates.  
Then we defined a policy $\pi$ as the collection of decision rules $\left( \pi_t \right)_{t\in\mathbb{N}}$.
- Using the answer to the previous questions, write the definition of a Markovian policy, then a stationary Markovian policy (the answer is actually in the text just after the Optimal policy theorem, the exercise is about being able to recall and explain the definitions and what they imply). 
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

- $\pi_t$ describes the distribution over actions at time step $t$. Because of causality (future events don't affect current events), it can only depend on the realization of the state and actions random variables in previous time steps:
$$\mathbb{P}(A_t) = \pi_t(A_t | S_0, A_0, \ldots, S_{t-1}, A_{t-1}, S_t)$$
We define the *history* $H_t = S_0, A_0, \ldots, S_{t-1}, A_{t-1}, S_t$ at time step $t$ as this random sequence. So:
$$\mathbb{P}(A_t) = \pi_t(A_t | H_t)$$

- In plain words, for an action $a$ and a history $h$ at step $t$, $\pi_t(a|h)$ indicates  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; the probability to pick action $a$  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; at time $t$,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; given the history of states/actions $h$.  
This is called a *history-dependent, non-stationary, stochastic* policy and is the most generic class of policies.

- In a Markovian policy, all decision rules are only conditioned by the last encountered state.
$$\pi_t(A_t|H_t) = \pi_t(A_t | S_t)$$
In other words, given two (possibly different) sequences of state-action random variables realizations up to time $t-1$ and a single realization of S_t the distribution of $A_t$ is the same.
Mathematically: 
$\left\{\begin{array}{l}
\forall \left(s_i,a_i\right)_{i\leq t-1}\in \left(S\times A\right)^{t-1}\\
\forall \left(s'_i,a'_i\right)_{i\leq t-1}\in \left(S\times A\right)^{t-1}\\
\forall s \in S
\end{array}\right.,$
\begin{align*}
    \pi_t(A_t|H_t) &= \pi_t\left(A_t|S_0=s_0, A_0=a_0, \ldots, S_t=s\right)\\
    &= \pi_t\left(A_t|S'_0=s'_0, A'_0=a'_0, \ldots, S_t=s\right)\\
    &= \pi_t(A_t | S_t)
\end{align*}
One writes $\pi_t(A_t|S_t=s)$, or more simply $\pi_t(\cdot | s)$.  
In a stationary Markovian policy, all decision rules are the same throughout time. Mathematically:  
$\forall (t,t')\in \mathbb{N}^2, \pi_t(A_t|S_t=s) = \pi_t(A_{t'}|S_{t'}=s)$.  
This unique distribution is written $\pi(\cdot | s) = \pi_t( \cdot | s)$.
</details>

<div class="alert alert-warning">

**Exercise**  
In the patient example, suppose the physician tells the patient to take drug A every day for 5 days, then drug B every two days for 9 days, then come back for a check-up. The physician adds to take drug C once a day if the patient feels pain over two consecutive days. Can you write the sequence of corresponding decision rules?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

This prescription is made over a finite horizon $H=14$ days. The actions are the combinations of drugs $A=\left\{ \emptyset, (A), (B), (C), (A,B), (A,C), (B,C), (A,B,C) \right\}$.   
    
The prescription is deterministic: the distribution over actions is a Dirac. We will write it $a_t = \pi_t(h)$.
    
The prescription depends on the two last states of the patient. So it's not Markovian, it is history-dependent. Precisely, it depends on the boolean state variable "is there pain?". So we can write $\pi_t(h) = \pi_t(s_t,s_{t-1})$.  
  
It also is not stationary, since the prescription changes after day 5.  
    
Consequently, the policy is:  
For $t \in [1, 5]$:   
if $pain(s_t,s_{t-1})=True$, $\pi_t(s_t,s_{t-1}) = (A,C)$,  
if $pain(s_t,s_{t-1})=False$, $\pi_t(s_t,s_{t-1}) = (A)$.  
For $t \in [6, 14]$:   
if $t$ is even and $pain(s_t,s_{t-1})=True$, $\pi_t(s_t,s_{t-1}) = (B,C)$,  
if $t$ is even and $pain(s_t,s_{t-1})=False$,  $\pi_t(s_t,s_{t-1}) = (B)$,  
if $t$ is odd and $pain(s_t,s_{t-1})=True$, $\pi_t(s_t,s_{t-1}) = (C)$,  
if $t$ is odd and $pain(s_t,s_{t-1})=False$,  $\pi_t(s_t,s_{t-1}) = \emptyset$
</details>

<div class="alert alert-warning">

**Exercise**  
Use the FrozenLake environment we've introduced earlier to obtain a Monte-Carlo estimate of $V^\pi(s_0)$ over 100000 trials, with $s_0$ being the initial state and $\pi$ being a simple policy that always goes right. Take $\gamma = 0.9$. Yes, the code is almost the same as the example provided earlier.  
Note that $\gamma^{200} \sim 10^{-9}$ so any reward obtained after 200 time steps will have a negligible contribution to $V^\pi(s_0)$, thus rolling an episode out for 200 time steps should be sufficient.
</div>

In [None]:
# %load solutions/RL1_exercise1.py
### WRITE YOUR CODE HERE
# If you get stuck, uncomment the line above to load a correction in this cell (then you can execute this code).

<div class="alert alert-warning">

**Exercise and note on the stationary distribution under policy $\pi$**  

Let's consider an MDP and a certain policy $\pi$. Let's initialize the MDP to a starting state $s_0$ drawn from a distribution $\rho_0(s)$ and let's look at how the state evolves across time steps.

Because the stochastic process of $S_t$ is a Markov chain (since $\pi$ is fixed, the probability of reaching $S_{t+1}$ is only conditionned by $S_t$), in the long run, the distribution of states follows a stationary distribution $\rho^\pi(s|s_0)$.

This distribution is not necessarily unique: it depends on $s_0$. When all states are represented with non-zero probability in this distribution, the corresponding Markov chain is said to be *ergodic*. This is an assumption that will often be made to simply future reasoning, even if it is false most of the time.

What can we say about the stationary distribution of the Markov chain corresponding to:
- the patient with a chronic disease under a policy that fights off the disease?
- the patient with a deadly disease under a policy that doesn't cure her?
- the FrozenLake example with a fixed random policy?
- the Mad Hatter's casino (from the previous class) under a fixed random policy?
</div>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

The patient with a chronic disease under a policy that fights off the disease will most likely live a rather long life (let's say infinite, for the sake of this example) and will explore states that are linked to the evolution of the disease. The states corresponding to non-recoverable situations however will not be visited.
    
The patient with a deadly disease and a bad treatment policy will likely die, sadly. On an infinite horizon, the stationary distribution only has probability mass on the states corresponding to death.
    
Similarly, the FrozenLake example has several terminal states, either by reaching the goal or by falling into a hole. It should be noted however that for such episodic environments, it is possible to define an alternate distribution $\rho^\pi(s|s_0)$ that describes the distribution of states before termination.
    
Finally, the Mad Hatter's casino under a fixed random policy is a very nice ergodic Markov chain: from any starting state there is a non-zero probability of reaching any state in a finite number of steps. No terminal states in wonderland!
</details>