# Reinforcement Learning fundamentals

## Foreword <a class="tocSkip">
    
How this course works (pedagogy):
- one notebook to rule them all (them = the concepts)
- no slides
- short exercices along the way
- a bit of live coding
- two class breaks for you to breathe
    
What you should expect:
- some plain words notions,
- but also a fair bit of (hopefully painless) rigorous notations and concepts.
- Also most things will be fully written down to increase your autonomy in replaying the notebook.

Color code:
<div class="alert alert-success">Key results in green boxes</div>
<div class="alert alert-warning">Exercices in yellow boxes</div>

Prerequisites:

<div class="alert alert-warning">

**Prerequisites:**
- Basic algebra
- Random variables, probability distributions.
    
**Useful but not compulsory:**
- Random processes, Markov chains.
<div>

<h1><span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Class-goals" data-toc-modified-id="Class-goals-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Class goals</a></span></li><li><span><a href="#Ruining-the-suspense-with-a-general-definition-(5-minutes)" data-toc-modified-id="Ruining-the-suspense-with-a-general-definition-(5-minutes)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Ruining the suspense with a general definition (5 minutes)</a></span></li><li><span><a href="#RL-within-Machine-Learning-(5-minutes)" data-toc-modified-id="RL-within-Machine-Learning-(5-minutes)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>RL within Machine Learning (5 minutes)</a></span></li><li><span><a href="#From-plain-words-to-variables-(5-minutes)" data-toc-modified-id="From-plain-words-to-variables-(5-minutes)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>From plain words to variables (5 minutes)</a></span></li><li><span><a href="#Modeling-sequential-decision-problems-with-Markov-Decision-Processes-(30-minutes)" data-toc-modified-id="Modeling-sequential-decision-problems-with-Markov-Decision-Processes-(30-minutes)-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Modeling sequential decision problems with Markov Decision Processes (30 minutes)</a></span><ul class="toc-item"><li><span><a href="#Definition" data-toc-modified-id="Definition-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Definition</a></span></li><li><span><a href="#Value-of-a-trajectory-/-of-a-policy" data-toc-modified-id="Value-of-a-trajectory-/-of-a-policy-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Value of a trajectory / of a policy</a></span></li><li><span><a href="#Optimal-policies" data-toc-modified-id="Optimal-policies-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Optimal policies</a></span></li><li><span><a href="#Stationary-distribution" data-toc-modified-id="Stationary-distribution-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Stationary distribution</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Summary</a></span></li></ul></li><li><span><a href="#Characterizing-value-functions:-the-Bellman-equations-(20-minutes)" data-toc-modified-id="Characterizing-value-functions:-the-Bellman-equations-(20-minutes)-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Characterizing value functions: the Bellman equations (20 minutes)</a></span></li><li><span><a href="#Dynamic-Programming-for-MDPs-(30-minutes)" data-toc-modified-id="Dynamic-Programming-for-MDPs-(30-minutes)-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Dynamic Programming for MDPs (30 minutes)</a></span></li><li><span><a href="#Learning-optimal-value-functions-(30-minutes)" data-toc-modified-id="Learning-optimal-value-functions-(30-minutes)-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Learning optimal value functions (30 minutes)</a></span></li><li><span><a href="#Direct-policy-optimization-(15-minutes)" data-toc-modified-id="Direct-policy-optimization-(15-minutes)-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Direct policy optimization (15 minutes)</a></span></li><li><span><a href="#Three-fundamental-challenges-in-RL-(10-minutes)" data-toc-modified-id="Three-fundamental-challenges-in-RL-(10-minutes)-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Three fundamental challenges in RL (10 minutes)</a></span></li></ul></div>

## Class goals

- acquire the fundamental building blocks of RL:
    - plain word notions
    - MDPs, policies, optimality equations, etc.
    - common notations
    - key algorithms
    - common misconceptions
- key challenges in RL and their connection to RLVS lectures

## Ruining the suspense with a general definition (5 minutes)


What is Reinforcement Learning about?

It is about learning to control dynamic systems.
<img src="img/dynamic.png" style="width: 400px;"></img>
Dynamic systems? **dynamic** evolution of $s$ and $o$ under $\pi$.

Our object of study:<br>
We want to find a control policy $\pi$ (with $u = \pi(o)$) such that the system $\Sigma$ behaves as we desire.

### Examples of RL problems <a class="tocSkip">


<table>
<tr>
  <td><img src="img/spiral.jpg" style="width: 200px;"></td>
  <td style="border-right:1px solid;">Exiting a spiral</td>
  <td><img src="img/tests.jpg" style="width: 200px;"></td>
  <td>Dynamic treatment regimes for HIV patients</td>
</tr>
<tr>
  <td><img src="img/pend.png" style="width: 200px;"></td>
  <td style="border-right:1px solid;">Cart-pole balancing</td>
  <td><img src="img/waiting.jpg" style="width: 200px;"></td>
  <td>Queueing problems</td>
</tr>
<tr>
  <td><img src="img/market.jpg" style="width: 200px;"></td>
  <td style="border-right:1px solid;">Portfolio management</td>
  <td><img src="img/dam.jpg" style="width: 200px;"></td>
  <td>Hydroelectric production</td>
</tr>
</table>

But also:
- Elevator scheduling
- Bicyle riding
- Ship steering
- Bioreactor control
- Aerobatics helicopter control
- Airport departures scheduling
- Airlines scheduling
- Robocup soccer
- Video game playing (Quake, CS, Starcraft...)
- Game of Go
- ...

<div class="alert alert-success">
    
Reinforcement Learning is about learning an optimal sequential behavior in a given environment.
</div>

Let's break this down.
- sequential behavior in a given environment
- optimal
- learning

<center><img src="img/dynamic.png" style="width: 400px;"></img></center>

<div class="alert alert-success">

**Keywords:**
- system to control / environment
- control policy
- optimality
</div>

<div class="alert alert-warning">
    
**Warm-up poll:** 
    
</div>

## RL within Machine Learning (5 minutes)

You may have had classes on Machine Learning before. There are three strongly distinct categories of problems in ML:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning

Let's try to answer the following questions for each category.
- What's the abstract problem we are trying to solve?
- What's the data provided to the algorithms?
- Give examples of algorithms in SL/UL/RL.  

<center>
<table border="1">
<tr>
    <td> <b>Question</b> </td>
    <td style="border-left: 1px solid black"> <b>Supervised</b> </td>
    <td style="border-left: 1px solid black"> <b>Unsupervised</b> </td>
    <td style="border-left: 1px solid black"> <b>Reinforcement</b> </td>
</tr>
<tr>
    <td> Target </td>
    <td style="border-left: 1px solid black"> $f(x)=y$ </td>
    <td style="border-left: 1px solid black"> $x\in X$ </td>
    <td style="border-left: 1px solid black"> $\pi(s)=a$ </td>
</tr>
<tr>
    <td> Target (rephrased) </td>
    <td style="border-left: 1px solid black"> Predict outputs given inputs</td>
    <td style="border-left: 1px solid black"> Discover structure in data </td>
    <td style="border-left: 1px solid black"> Find an optimal behavior </td>
</tr>
<tr>
    <td> Data </td>
    <td style="border-left: 1px solid black"> $\left\{\left(x,y\right)\right\}$ supervisor's labels </td>
    <td style="border-left: 1px solid black"> $\left\{x\right\}$ unlabelled data </td>
    <td style="border-left: 1px solid black"> $\left\{\left(s,a,r,s'\right)\right\}$ experience samples </td>
</tr>
<tr>
    <td> Output </td>
    <td style="border-left: 1px solid black"> Classifier or regressor</td>
    <td style="border-left: 1px solid black"> Clusters or dimension reduction </td>
    <td style="border-left: 1px solid black"> Policies, value functions </td>
</tr>
<tr>
    <td> Key algorithms </td>
    <td style="border-left: 1px solid black"> Neural networks, SVMs, etc.</td>
    <td style="border-left: 1px solid black"> k-means, PCA, etc. </td>
    <td style="border-left: 1px solid black"> Q-learning, Policy Gradients, etc. </td>
</tr>
</table>
</center>

This table helps distinguish the different natures of the problems tackled. The RL problem is about finding the optimal policy for a given environment.

How is this different from Supervised Learning?
- no correct $(s,a)$ example, rather $(s,a,r,s')$ samples
- Delayed rewards, credit assignement, trajectories

<div class="alert alert-warning">
    
**Poll:** How is RL different from SL?
- a
- b
- c
- d
    
</div>

## From plain words to variables (5 minutes)

### A medical prescription example <a class="tocSkip">

<img src="img/patient-doctor.png" style="height: 200px;">
    
A patient walks into a clinic with her medical file (medical history, x-rays, blood work, etc.). You, as her doctor, need to write a prescription. Let us use this example to formalize the process of deciding what to write on the prescription.

### Patient variables <a class="tocSkip">

<center>
<img src="img/patient_file.png" style="height: 100px;"> </img> <br>
Patient state now: $S_0$  <br>
Future states: $S_t$
</center>

The medical file of the patient allows us to define a number of variables that characterize the patient now. We will write $S_0$ the vector of these variables. Future measurements will be noted $S_t$.

$S_t$ is a random vector, taking different values in a *patient description space* $S$ at different time steps.

### Prescription <a class="tocSkip">

<center>
<img src="img/prescription.png" style="height: 100px;"> </img> <br>
Prescription: $\left( A_t \right)_{t\in\mathbb{N}} = (A_0, A_1, A_2, ...)$
</center>

The prescription is a series of recommendations we give to the patient over the course of treatment. It is thus a sequence $\left( A_t \right)_{t\in\mathbb{N}} = (A_0, A_1, A_2, ...)$ of variables $A_t$.

These treatments $A_t$ are random variables too, taking their value in some space $A$.

### Patient evolution <a class="tocSkip">


<center>
<img src="img/patient_evolution.png" style="height: 100px;"> </img> <br>
    $\mathbb{P}(S_t)$?
</center>

The patient evolves over time steps. Her evolution follows a certain probability distribution $\mathbb{P}(S_t)$ over descriptive states.

So $\left( S_t \right)_{t\in\mathbb{N}}$ defines a *random process* that describes the patient's evolution under the influence of past $S_t$ and $A_t$.

### Physician's goal <a class="tocSkip">

<img src="img/patient_happy.png" style="height: 100px;"> </img> <br>

$$J \left( \left(S_t\right)_{t\in \mathbb{N}}, \left( A_t \right)_{t\in \mathbb{N}} \right)?$$

The physician's goal is to bring the patient from an unhealthy state $S_0$ to a healthy situation.  

This goal is not only defined by a final state of the patient but by the full trajectory followed by the variables $S_t$ and $A_t$. For example, prescribing a drug that damages the patient's liver, or letting the patient experience too much pain over the course of treatment is discouraged.

We define a criterion $J \left( \left(S_t\right)_{t\in \mathbb{N}}, \left( A_t \right)_{t\in \mathbb{N}} \right)$ that allows to quantify how good a trajectory in the joint $S\times A$ space is.

### Wrap-up <a class="tocSkip">

- Patient state $S_t$  (random variable)
- Physician instruction $A_t$ (random variable)
- Prescription $\left( A_t \right)_{t\in\mathbb{N}}$   
- Patient's evolution $\mathbb{P}(S_t)$  
- Patient's trajectory $\left( S_t \right)_{t\in\mathbb{N}}$ random process
- Value of a trajectory $J \left( \left(S_t\right)_{t\in \mathbb{N}}, \left( A_t \right)_{t\in \mathbb{N}} \right)$  

It seems reasonable that the physician's recommendation $\mathbb{P}(A_t)$ at step $t$ be dependent on previously observed states $\left(S_0, \ldots, S_t\right)$ and recommended treatments $\left(A_0, \ldots, A_{t-1}\right)$.

### Common misconception <a class="tocSkip">

You will often see the following type of drawing, along with a sentence like "RL is concerned with the problem on an agent performing actions to control an environment". 

<img src="img/misconception.png" style="height: 300px;"></img>

Although this sentence is not false *per se*, it conveys an important misconception that may be grounded in too simple anthropomorphic analogies. One often talks about the *state of the agent* or the *state of the environment*. The distinction here is confusing at best: there is no separation between agent and environment. A better vocabulary is to talk about a *system to control*, that is described through its observed *state*. This system is controlled by the application of actions issued from a *policy* or *control law*. The process of *learning* this policy is what RL is concerned with.

Although less shiny, the drawing below may be less misleading.

<img src="img/dynamic.png" style="height: 300px;"></img>

### Three key notions <a class="tocSkip">

RL is a three-stage rocket answering the questions:  
1. What is the system to control?  
2. What is an optimal strategy?  
3. How do we learn such a strategy?

<div class="alert alert-warning">
    
**Poll:**
    
</div>

## Modeling sequential decision problems with Markov Decision Processes (30 minutes)

### Definition

Let's take a higher view and develop a general theory for describing problems such as writing a prescription for our patient.

Let us assume we have:
- a set of states $S$ for the system to control,
- a set of actions $A$ we can apply.

Curing patients is a conceptually difficult task. 
To keep things grounded, we shall use a toy example called [FrozenLake](https://gym.openai.com/envs/FrozenLake-v0/) and work our way to more general concepts. It's also the occasion to familiarize with [OpenAI Gym](https://gym.openai.com/).

In [7]:
import gym
import gym.envs.toy_text.frozen_lake as fl

env = gym.make('FrozenLake-v0')
_=env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


Let's take a look at this problem's description (using for example `help(fl.FrozenLakeEnv)`). We read:

`|  Winter is here. You and your friends were tossing around a frisbee at the park
|  when you made a wild throw that left the frisbee out in the middle of the lake.
|  The water is mostly frozen, but there are a few holes where the ice has melted.
|  If you step into one of those holes, you'll fall into the freezing water.
|  At this time, there's an international frisbee shortage, so it's absolutely imperative that
|  you navigate across the lake and retrieve the disc.
|  However, the ice is slippery, so you won't always move in the direction you intend.
|  The surface is described using a grid like the following
|  
|      SFFF
|      FHFH
|      FFFH
|      HFFG
|  
|  S : starting point, safe
|  F : frozen surface, safe
|  H : hole, fall to your doom
|  G : goal, where the frisbee is located
|  
|  The episode ends when you reach the goal or fall in a hole.
|  You receive a reward of 1 if you reach the goal, and zero otherwise.`

So it's a game of navigation.

<div class="alert alert-warning"><b>Questions:</b><br>What are the possible states of an agent in this game?<br> What are its possible actions?<br>How would you describe the result of action $a$ in state $s$?</div>

<div class="alert alert-danger"><a href="#answers1" data-toggle="collapse"><b>Answers:</b></a><br>
<div id="answers1" class="collapse">
States set: the 16 positions on the map.<br>
Actions set: the 4 actions $\{$N,S,E,W$\}$<br>
$s'$ resulting from $(s,a)$ follows a distribution $P(s'|s,a)$<br>
</div>
</div>

Let's confirm that:

In [8]:
print(env.observation_space)
print(env.action_space)

Discrete(16)
Discrete(4)


At every time step, the system state is $S_t$ and we decide to apply action $A_t$. This results in observing a new state $S_{t+1}$ and receiving a scalar reward signal $R_t$ for this transition.

$R_t$ tells us how happy we are with the last transition.

Note that $S_t$, $A_t$, $S_{t+1}$ and $R_t$ are random variables.

For example, in FrozenLake, all transitions have reward 0 except for the one that reaches the goal, which yields reward 1. Let's verify this and introduce a few utility functions on the way.

In [9]:
actions = {fl.LEFT: '\u2190', fl.DOWN: '\u2193', fl.RIGHT: '\u2192', fl.UP: '\u2191'}

def to_s(row,col):
    return row*env.unwrapped.ncol+col

def to_row_col(s):
    col = s%env.unwrapped.ncol
    row = int((s-col)/env.unwrapped.ncol)
    return row,col

print(actions)
row=3
col=2
a=2
print("Apply ", actions[2], " from (", row, ", ", col, "):", sep='')
for tr in env.unwrapped.P[to_s(row,col)][a]:
    print("  Reach (", to_row_col(tr[1]), ") and get reward ", tr[2], " with proba ", tr[0], ".", sep='')

{0: '←', 1: '↓', 2: '→', 3: '↑'}
Apply → from (3, 2):
  Reach ((3, 2)) and get reward 0.0 with proba 0.3333333333333333.
  Reach ((3, 3)) and get reward 1.0 with proba 0.3333333333333333.
  Reach ((2, 2)) and get reward 0.0 with proba 0.3333333333333333.


We will now make our main assumption about the systems we want to control.

<div class="alert alert-success">
    
**Fundamental assumption (Markov property)**
$$\mathbb{P}(S_{t+1},R_t|S_t, A_t, S_{t-1}, A_{t-1}, \ldots, S_0, A_0) = \mathbb{P}(S_{t+1},R_t|S_t, A_t)$$
</div>
    
Such a system will be called a Markov Decision Process (MDP).

One generally separates the state dynamics and the rewards by:
$$\mathbb{P}(S_{t+1},R_t|S_t, A_t) = \mathbb{P}(S_{t+1}|S_t, A_t)\cdot \mathbb{P}(R_t|S_t, A_t, S_{t+1})$$

Which leads in turn to the general definition of an MDP:
<div class="alert alert-success"><b>Markov Decision Process (MDP)</b><br>
A Markov Decision Process is given by:
<ul>
<li> A set of states $S$
<li> A set of actions $A$
<li> A (Markovian) transition model $\mathbb{P}\left(S_{t+1} | S_t, A_t \right)$, noted $p(s'|s,a)$
<li> A reward model $\mathbb{P}\left( R_t | S_t, A_t, S_{t+1} \right)$, noted $r(s,a)$ or $r(s,a,s')$
<li> A set of discrete decision epochs $T=\{0,1,\ldots,H\}$
</ul>
</div>

Most of the results presented here can be found in M. L. Puterman's classic book, [Markov Decision Processes: Discrete Stochastic Dynamic Programming](https://www.wiley.com/en-us/Markov+Decision+Processes%3A+Discrete+Stochastic+Dynamic+Programming-p-9781118625873).

If $H\rightarrow\infty$ we have an infinite horizon control problem.

<div class="alert alert-success">

Since we will only work with infinite horizon problems, we shall identify the MDP with the 4-tuple $\langle S,A,p,r\rangle$.
</div>
    
So, in RL, we wish to control the trajectory of a system that, we suppose, behaves as a Markov Decision Process.

<img src="img/dynamic.png" style="height: 240px;"></img>

### Value of a trajectory / of a policy

Suppose an oracle decides on how to choose actions at each time step according to the probability distribution $\mathbb{P}(A_t)=\pi(A_t)$. This collection of probability distributions is the oracle's **policy**. Given a distribution on an initial state $S_0$, it fully conditions the trajectory $S_0, A_0, R_0, S_1, A_1, R_1, \ldots$.

In FrozenLake as in the patient's example, some trajectories are better than others. We shall introduce a criterion to compare trajectories. Intuitively, this criterion should reflect the idea that a good policy accumulates as much reward as possible along a trajectory.

Let's compare the policy that always moves to the right and the policy that always moves down by summing the rewards obtained along trajectories and then averaging these rewards across trajectories.

In [6]:
import numpy as np
nb_episodes = 100000
horizon = 200

Vright = np.zeros(nb_episodes)
for i in range(nb_episodes):
    env.reset()
    for t in range(horizon):
        next_state, r, done,_ = env.step(fl.RIGHT)
        Vright[i] += r
        if done:
            break

Vleft  = np.zeros(nb_episodes)
for i in range(nb_episodes):
    env.reset()
    for t in range(horizon):
        next_state, r, done,_ = env.step(fl.LEFT)
        Vleft[i] += r
        if done:
            break

print("est. value of 'right' policy:", np.mean(Vright), "variance:", np.std(Vright))
print("est. value of 'left'  policy:", np.mean(Vleft),  "variance:", np.std(Vleft))

value of 'right' policy: 0.03138 variance: 0.17434246642743126
value of 'left' policy:  0.0 variance: 0.0


In the general case, this sum of rewards on an infinite horizon might be unbounded. So let us introduce the **$\gamma$-discounted sum of rewards** (from a starting state $s$, under policy $\pi$) random variable:
$$G^\pi(s) = \sum\limits_{t = 0}^\infty \gamma^t R_t \quad \Bigg| \quad \begin{array}{l}S_0 = s,\\ A_t \sim \pi(S_t),\\ S_{t+1}\sim p(\cdot|S_t,A_t),\\R_t = r(S_t,A_t,S_{t+1}).\end{array}$$

With $\gamma \in (0,1)$, this infinite sum of rewards represents what we can gain in the long-term by applying the actions from $\pi$, given that a reward obtained $t$ time steps in the future is discounted by $\gamma^t$. For bounded reward models, since $\gamma <1$, this sum is always finite.

Then, given a starting state $s$, we can define the value of $s$ under policy $\pi$:
$$V^\pi(s) = \mathbb{E} \left[ G^\pi(s) \right]$$

This defines the value function $V^\pi$ of policy $\pi$:
<div class="alert alert-success"><b>Value function $V^\pi$ of a policy $\pi$ under a $\gamma$-discounted criterion</b><br>
$$V^\pi : \left\{\begin{array}{ccl}
S & \rightarrow & \mathbb{R}\\
s & \mapsto & V^\pi(s)=\mathbb{E}\left( \sum\limits_{t = 0}^\infty \gamma^t R_t \bigg| S_0 = s, \pi \right)\end{array}\right. $$
</div>


And, given a distribution $\rho_0$ on starting states, we can map $\pi$ to the scalar value:
$$J(\pi) = \mathbb{E}_{s \sim \rho_0} \left[ V^\pi(s) \right]$$

Note that this definition is quite arbitrary: instead of the expected (discounted) sum of rewards, we could have taken the average reward over all time steps, or some other (more or less exotic) comparison criterion between policies.

Most of the RL literature uses this discounted criterion (in some cases with $\gamma=1$), some uses the average reward criterion, and few works venture into more exotic criteria. Today, we will limit ourselves to the discounted criterion.

### Optimal policies

The fog clears up a bit: we can now compare policies given an initial state (or initial state distribution).  
We can now define what an optimal policy is.  

<div class="alert alert-success"><b>Optimal policy $\pi^*$</b><br>
$\pi^*$ is said to be optimal iff $\pi^* \in \arg\max\limits_{\pi} V^\pi$.<br>
<br>
    
A policy is optimal if it **dominates** over any other policy in every state:
$$\pi^* \textrm{ is optimal}\Leftrightarrow \forall s\in S, \ \forall \pi, \ V^{\pi^*}(s) \geq V^\pi(s)$$
</div>

Note that one could also define a somewhat weaker notion of optimality, stating that a policy $\pi^*$ is optimal if:
$$\pi^* \in \arg\max_{\pi} J(\pi).$$

We now get to our first fundamental result. Fortunately for us...  

<div class="alert alert-success"><b>Optimal policy theorem</b><br>
For $\left\{\begin{array}{l}
\gamma\textrm{-discounted criterion}\\
\textrm{infinite horizon}
\end{array}\right.$, 
there always exists at least one optimal stationary, deterministic, Markovian policy.
</div>

Let's explain a little:
- Markovian : $\left\{\begin{array}{l}
\forall \left(s_i,a_i\right)\in \left(S\times A\right)^{t-1}\\
\forall \left(s'_i,a'_i\right)\in \left(S\times A\right)^{t-1}
\end{array}\right., \pi\left(A_t|S_0, A_0, \ldots, S_t\right) = \pi\left(A_t|S'_0, A'_0, \ldots, S_t\right)$.  
One writes $\pi(A_t|S_t)$.
- Stationary : $\forall (t,t')\in \mathbb{N}^2, \pi(A_t|S_t=s) = \pi(A_{t'}|S_{t'}=s)$.
- Deterministic : $\pi(A_t|history) = \left\{\begin{array}{l}
1\textrm{ for a single }a\\
0\textrm{ otherwise}
\end{array}\right.$.

So in simpler words, we know that among all possible optimal policies, at least one is a function $\pi:S\rightarrow A$.

That helps a lot: we don't have to search for optimal policies in a complex family of history-dependent, stochastic, non-stationary policies; instead we can simply search for a function $\pi(s)=a$ that maps states to actions.

### Stationary distribution

Let's consider an MDP and a certain policy $\pi$. Let's initialize the MDP to a starting state $s_0$ drawn from a distribution $\rho_0(s)$ and let's look at how the state evolves across time steps.

Because the stochastic process of $S_t$ is a Markov chain (since $\pi$ is fixed, the probability of reaching $S_{t+1}$ is only conditionned by $S_t$), in the long run, the distribution of states follows a stationary distribution $\rho^\pi(s|s_0)$.

This distribution is not necessarily unique: it depends on $s_0$. When all states are represented with non-zero probability in this distribution, the corresponding Markov chain is said to be *ergodic*. This is an assumption that will often be made to simply future reasoning, even if it is false most of the time.

<div class="alert alert-warning">
    
**Exercise**  
What can we say about the stationary distribution of the Markov chain corresponding to:
- the patient with a chronic disease under a policy that fights off the disease?
- the patient with a deadly disease under a policy that doesn't cure her?
- the FrozenLake example with a fixed random policy?
</div>

<div class="alert alert-danger"><a href="#ergodic" data-toggle="collapse"><b>Answer:</b></a><br>
<div id="ergodic" class="collapse">

The patient with a chronic disease under a policy that fights off the disease will most likely live a rather long life (let's say infinite, for the sake of this example) and will explore states that are linked to the evolution of the disease. The states corresponding to non-recoverable situations however will not be visited.
    
The patient with a deadly disease and a bad treatment policy will likely die, sadly. On an infinite horizon, the stationary distribution only has probability mass on the states corresponding to death.
    
Similarly, the FrozenLake example has several terminal states, either by reaching the goal or by falling into a hole. It should be noted however that for such episodic environments, it is possible to define an alternate distribution $\rho^\pi(s|s_0)$ that describes the distribution of states before termination.
    
Finally, the Mad Hatter's casino under a fixed random policy is a very nice ergodic Markov chain: from any starting state there is a non-zero probability of reaching any state in a finite number of steps. No terminal states in wonderland!
</div>
</div>

### Summary

Let's wrap this whole section up. Our goal was to formally define the search for the best strategy for our game of FrozenLake and the medical prescription problem. This has led us to formalizing the general **discrete-time stochastic optimal control problem**:
- Environment (discrete time, non-deterministic, non-linear, Markov) $\leftrightarrow$ MDP.
- Behaviour $\leftrightarrow$ control policy $\pi : s\mapsto a$.
- Policy evaluation criterion $\leftrightarrow$ $\gamma$-discounted criterion.
- Goal $\leftrightarrow$ Maximize value function $V^\pi(s)$.

So we have built the first stage of our three-stage rocket.  
The question was "What is the system to control?" and our answer is "The system to control is a Markov Decision Process $\langle S, A, p, r \rangle$ and we will control it with a policy $\pi:s\mapsto a$ in order to optimize $\mathbb{E} \left( \sum_t \gamma^t R_t\right)$".

<div class="alert alert-warning">

**Poll** The limits of MDP modeling   
Can these systems be modeled as MDPs?   
- Playing a tennis video game based on a single video frame
- Playing a tennis video game based on a full physical description of the ball and the players
- The game of Poker
- The collaborative game of [Hanabi](https://en.wikipedia.org/wiki/Hanabi_(card_game))
</div>

<div class="alert alert-warning">
    
**Let's take a short break**
</div>


## Characterizing value functions: the Bellman equations (20 minutes)

- Q functions
- evaluation equation
- optimality equation

## Dynamic Programming for MDPs (30 minutes)

- value iteration
- policy iteration and modified policy iteration
- approximate dynamic programming

**Let's take a short break**

## Learning optimal value functions (30 minutes)

- reminder on stochastic approximation and SGD
- one step of SGD is an approximate resolution of an optimization problem
- eval equation -> TD
- AVI -> QL
- API -> SARSA

## Direct policy optimization (15 minutes)

- max V -> DPS and PG

## Three fundamental challenges in RL (10 minutes)

Function approximation, exploration, optimality.

These challenges are intrinsic to RL.  
But there are countless others, that depend on the context, e.g.:
- Hierarchical RL
- Multi agent RL
- Partially observable MDPs
- Robust RL
- Offline RL
- Transfer in RL

Connection to RLVS classes (map of RLVS)