<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" align="left" src="https://i.creativecommons.org/l/by-nc-sa/4.0/80x15.png" /></a>&nbsp;| [Emmanuel Rachelson](https://personnel.isae-supaero.fr/emmanuel-rachelson?lang=en) | <a href="https://erachelson.github.io/RLclass_MVA/">https://erachelson.github.io/RLclass_MVA/</a>

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Chapter 0: Reinforcement Learning class introduction; key intuitions</div>

<div class="alert alert-success">

**Learning outcomes:**  
This is the introductory chapter of the RL class. By the end of this chapter you should be able to:
- explain how this class works,
- give a general definition of RL,
- situate RL within the ML landscape,
- use the basic vocabulary of RL: states, observations, actions, transition dynamics, trajectories, values.
</div>

# Foreword <a class="tocSkip">


How this course works (pedagogically):
- a series of notebooks
- no slides
- short exercices along the way
- a bit of live coding
    
What you should expect:
- some plain words notions,
- but avoidance of over-simplification,
- and also a fair amount of (hopefully painless) rigorous notations and abstract concepts.
- Also most things will be fully written down to increase your autonomy in replaying the notebook.

Color code:
<div class="alert alert-success">Key results in green boxes</div>
<div class="alert alert-warning">Exercices in yellow boxes</div>
<div class="alert alert-danger">Solutions in red boxes</div>

And a first yellow box:

<div class="alert alert-warning">

**Prerequisites:**
- Basic algebra.
- Random variables, probability distributions.
- Gradient descent.
    
**Useful but not compulsory:**
- Random processes, Markov chains (stochastic processes class)
- Notion of contraction mapping.
- Dynamic Programming
- Stochastic Gradient Descent.
<div>

<div class="alert alert-warning">

**Python libraries requirements**
```
numpy
gymnasium
matplotlib
scikit-learn
pytorch
```
<div>

# Ruining the suspense with a general abstract definition

What is Reinforcement Learning about?

It is about learning to control dynamic systems.
<img src="img/dynamic.png" style="width: 400px;"></img>
Dynamic systems? **dynamic** evolution of $s$ and $o$ under $\pi$ over a certain time horizon.

Our object of study:<br>
We want to find a control policy $\pi$ (with $u = \pi(o)$) such that the system $\Sigma$ behaves as we desire.

learning from data
RL is a specific part of optimal control
dynamical system

## Examples of RL problems <a class="tocSkip">


<table>
<tr>
  <td><img src="img/pong.jpg" style="width: 200px;"></td>
  <td style="border-right:1px solid;">Playing a video game</td>
  <td><img src="img/humanoid.jpg" style="width: 200px;"></td>
  <td>Humanoid stand-up</td>
</tr>
<tr>
  <td><img src="img/spiral.jpg" style="width: 200px;"></td>
  <td style="border-right:1px solid;">Exiting a spiral</td>
  <td><img src="img/tests.jpg" style="width: 200px;"></td>
  <td>Dynamic treatment regimes</td>
</tr>
<tr>
  <td><img src="img/pend.png" style="width: 200px;"></td>
  <td style="border-right:1px solid;">Cart-pole balancing</td>
  <td><img src="img/waiting.jpg" style="width: 200px;"></td>
  <td>Queueing problems</td>
</tr>
<tr>
  <td><img src="img/market.jpg" style="width: 200px;"></td>
  <td style="border-right:1px solid;">Portfolio management</td>
  <td><img src="img/dam.jpg" style="width: 200px;"></td>
  <td>Hydroelectric production</td>
</tr>
</table>

But also:
- Elevator scheduling
- Bicyle riding
- Ship steering
- Bioreactor control
- Aerobatics helicopter control (stanford example)
- Airport departures scheduling
- Ecosystem regulation and preservation
- Robocup soccer
- Video game playing (Atari, Starcraft...)
- Game of Go
- ...

So, learning to play a board game, learning to juggle, learning to take good strategic decisions, learning to drive... all fall into the same category of **control problems** and Reinforcement Learning studies the process of **elaborating a good control strategy through interaction samples**.

<div class="alert alert-success">
    
Reinforcement Learning is about learning an optimal sequential behavior in a given environment.
</div>

Let's break this down.
- sequential behavior in a given environment  
$\rightarrow$ discrete time steps, sequence of actions
- optimal  
$\rightarrow$ a reward signal informs us of the quality of the last action
- learning  
$\rightarrow$ no known model a priori, just interaction samples, behavior adaptation.

<center><img src="img/dynamic.png" style="width: 400px;"></img></center>

<div class="alert alert-success">

**Keywords:**
- system to control / environment
- control policy
- optimality
</div>

<div class="alert alert-warning">
    
**Warm-up poll:** 
How do you do today?  
[https://linkto.run/p/BOOR15YA](https://linkto.run/p/BOOR15YA)
- Great, I'm learning RL!
- Great, but I'm scared the RL unicorn will turn into a difficult to tame rhino.
- Great, bring the math on (as long as you do it step by step).
- Why do you ask the question if the only answer is "Great"?
</div>

**Standing on the shoulders of giants**

> The idea that we learn by interacting with our environment is probably the first to occur to us when we think about the nature of learning. When an infant plays, waves its arms, or looks about, it has no explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals. Throughout our lives, such interactions are undoubtedly a major source of knowledge about our environment and ourselves. Whether we are learning to drive a car or to hold a conversation, we are acutely aware of how our environment responds to what we do, and we seek to influence what happens through our behavior. Learning from interaction is a foundational idea underlying nearly all theories of learning and intelligence. (Sutton & Barto, 2018, [Reinforcement Learning: an Introduction](http://incompleteideas.net/book/the-book-2nd.html))

Caveat: this is a definition of *learning*, not specifically of *reinforcement learning* (although it applies to RL), so it is worth giving some context.

# RL within Machine Learning

You may have had classes on Machine Learning before. There are three strongly distinct categories of problems in ML:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning

Let's try to answer the following questions for each category.
- What's the abstract problem we are trying to solve?
- What's the data provided to the algorithms?
- Give examples of algorithms in SL/UL/RL.  

<center>
<table border="1">
<tr>
    <td> <b>Question</b> </td>
    <td style="border-left: 1px solid black"> <b>Supervised</b> </td>
    <td style="border-left: 1px solid black"> <b>Unsupervised</b> </td>
    <td style="border-left: 1px solid black"> <b>Reinforcement</b> </td>
</tr>
<tr>
    <td> Target </td>
    <td style="border-left: 1px solid black"> $f(x)=y$, learning functions </td>
    <td style="border-left: 1px solid black"> $x\in X$, looking for sets, clusters, representations </td>
    <td style="border-left: 1px solid black"> $\pi(s)=a$, it learns functions but no supervising label, but samples of experience </td>
</tr>
<tr>
    <td> Target (rephrased) </td>
    <td style="border-left: 1px solid black"> Predict outputs given inputs</td>
    <td style="border-left: 1px solid black"> Discover structure in data </td>
    <td style="border-left: 1px solid black"> Find an optimal behavior </td>
</tr>
<tr>
    <td> Data </td>
    <td style="border-left: 1px solid black"> $\left\{\left(x,y\right)\right\}$ supervisor's labels </td>
    <td style="border-left: 1px solid black"> $\left\{x\right\}$ unlabelled data </td>
    <td style="border-left: 1px solid black"> $\left\{\left(s,a,r,s'\right)\right\}$ experience samples </td>
</tr>
<tr>
    <td> Output </td>
    <td style="border-left: 1px solid black"> Classifier or regressor</td>
    <td style="border-left: 1px solid black"> Clusters or dimension reduction </td>
    <td style="border-left: 1px solid black"> Policies, value functions </td>
</tr>
<tr>
    <td> Key algorithms </td>
    <td style="border-left: 1px solid black"> Neural networks, SVMs, etc.</td>
    <td style="border-left: 1px solid black"> k-means, PCA, etc. </td>
    <td style="border-left: 1px solid black"> Q-learning, Policy Gradients, etc. </td>
</tr>
</table>
</center>

This table helps distinguish the different natures of the problems tackled. The RL problem is about finding the optimal policy for a given environment.

How is this different from Supervised Learning?
- no correct $(s,a)$ example, rather $(s,a,r,s')$ samples
- Delayed rewards, credit assignment, trajectories

<div class="alert alert-warning">
    
**Poll:** Pick the true statement(s).  
[https://linkto.run/p/3OG3IJO3](https://linkto.run/p/3OG3IJO3)
- Sorting new emails as spam (or not) given a million labelled emails is a reinforcement learning task. (immediate reward, boils down to supervised ML task)
- Deciding what move to play at chess, based on thousands of previous games is a reinforcement learning task. (yes)
- Incrementally improving the accuracy of a radar detection software from online collected data is a reinforcement learning task. (keyword: detection, not RL task, incrementally improving is not reinforcement but gradient descent: online learning supervised problem, stick look for static function)
</div>

Inspirations for RL:
- Control theory and Stochastic processes for the **modeling** part
- Statistics, Optimization and Cognitive Psychology for the **learning** part

# From plain words to first variables

## A medical prescription example

<center><img src="img/patient-doctor.png" style="height: 200px;"></center>
    
A patient walks into a clinic with their medical file (medical history, x-rays, blood work, etc.). You, as their doctor, need to write a prescription. Let us use this example to formalize the process of deciding what to write on the prescription.

## Patient variables

<center>
<img src="img/patient_file.png" style="height: 100px;"> </img> <br>
Patient state now: $S_0$  <br>
Future states: $S_t$
</center>

The medical file of the patient allows us to define a number of variables that characterize the patient now. We will write $S_0$ the vector of these variables. Future measurements will be noted $S_t$.

$S_t$ is a random vector, taking different values in a *patient description space* $S$ at different time steps.

## Prescription

<center>
<img src="img/prescription.png" style="height: 100px;"> </img> <br>
Prescription: $\left( A_t \right)_{t\in\mathbb{N}} = (A_0, A_1, A_2, ...)$
</center>

The prescription is a series of recommendations we give to the patient over the course of treatment. It is thus a sequence $\left( A_t \right)_{t\in\mathbb{N}} = (A_0, A_1, A_2, ...)$ of variables $A_t$.

These treatments $A_t$ are random variables too, taking their value in some space $A$.

## Patient evolution


<center>
<img src="img/patient_evolution.png" style="height: 100px;"> </img> <br>
    $\mathbb{P}(S_t)$?
</center>

The patient evolves over time steps. Their evolution follows a certain probability distribution $\mathbb{P}(S_t)$ over descriptive states.

So $\left( S_t \right)_{t\in\mathbb{N}}$ defines a *random process* that describes the patient's evolution under the influence of past $S_t$ and $A_t$.

## Physician's goal

<center><img src="img/patient_happy.png" style="height: 100px;"> </img> <br></center>

$$J \left( \left(S_t\right)_{t\in \mathbb{N}}, \left( A_t \right)_{t\in \mathbb{N}} \right)?$$

The physician's goal is to bring the patient from an unhealthy state $S_0$ to a healthy situation.  

**This goal is not only defined by a final state of the patient but by the full trajectory followed by the variables $S_t$ and $A_t$. For example, prescribing a drug that damages the patient's liver, or letting the patient experience too much pain over the course of treatment is discouraged.**

We define a criterion $J \left( \left(S_t\right)_{t\in \mathbb{N}}, \left( A_t \right)_{t\in \mathbb{N}} \right)$ that allows to quantify how good a trajectory in the joint $S\times A$ space is.

## Wrap-up

- Patient state $S_t$, random variable,
- Physician instruction $A_t$, random variable,
- Prescription $\left( A_t \right)_{t\in\mathbb{N}}$, sequence of random variables, random process  
- Patient's evolution $\mathbb{P}(S_t)$,  
- Patient's state trajectory $\left( S_t \right)_{t\in\mathbb{N}}$, random process, 
- Patient's full trajectory $\left( S_t, A_t \right)_{t\in\mathbb{N}}$, random process, 
- Value of a trajectory $J \left( \left(S_t, A_t \right)_{t\in \mathbb{N}} \right)$.  

It seems reasonable that the physician's recommendation $\mathbb{P}(A_t)$ at step $t$ be dependent on previously observed states $\left(S_0, \ldots, S_t\right)$ and recommended treatments $\left(A_0, \ldots, A_{t-1}\right)$.

# Common misconception

You will often see the following type of drawing, along with a sentence like "RL is concerned with the problem on an agent performing actions to control an environment". 

<center><img src="img/misconception.png" style="height: 300px;"></img></center>

Although this sentence is not false *per se*, it conveys an important misconception that may be grounded in too simple anthropomorphic analogies. One often talks about the *state of the agent* or the *state of the environment*. The distinction here is confusing at best: there is no separation between agent and environment. A better vocabulary is to talk about a *system to control*, that is described through its observed *state*. This system is controlled by the application of actions issued from a *policy* or *control law*. The process of *learning* this policy is what RL is concerned with.

Although less shiny, the drawing below may be less misleading.

<center><img src="img/dynamic.png" style="height: 300px;"></img></center>

Agent is a function.
If it is an RNN, interval variables for the agent.

Fully observable system, often a subset observation of internal state. partially observable problem: not mathematically 

# Vocabulary

- System to control
- State $S_t$ at time step $t$
- Action $A_t$ at time step $t$
- Observation $O_t$ at time step $t$
- System dynamics $\mathbb{P}(S_t)$
- Trajectory $(S_t, A_t)_{t\in \mathbb{N}}$
- Value of a trajectory $J \left( \left(S_t, A_t \right)_{t\in \mathbb{N}} \right)$ 
- Goal of RL: decide the probability $\mathbb{P}(A_t)$

<div class="alert alert-warning">
    
**Exercise (no poll, 1 minute):**  
Suppose that, instead of treating a patient, we want to learn to swing the pole up in the cart-pole example.  
What are the state description variables?  
What are the action variables?
</div>

it is a segway for instance

- System to control
- State $S_t$ at time step $t$: position of the cart (angle, velocity, position, then first order differential equations, Runge Kuntta): (x, xdot, theta, thetadot)
- Action $A_t$ at time step $t$: turn right, turn left, stop (force applied on the cart)
- Observation $O_t$ at time step $t$, position of the cart
- System dynamics $\mathbb{P}(S_t)$
- Trajectory $(S_t, A_t)_{t\in \mathbb{N}}$
- Value of a trajectory $J \left( \left(S_t, A_t \right)_{t\in \mathbb{N}} \right)$ 
- Goal of RL: decide the probability $\mathbb{P}(A_t)$

<center><img src="img/pend.png" style="width: 300px;"></center>

<details class="alert alert-danger">
    <summary markdown="span"><b>Ready to see the answer? (click to expand)</b></summary>

State: cart position and velocity $x, \dot{x}$, pole angle and velocity $\theta, \dot{\theta}$.
    
Action: force $F$ applied on the cart.
</details>

# Next steps: three key notions to RL

Understanding RL is a three-stage rocket, answering the questions:  
1. What is the system to control? (chapter 1) how to model ? MDP 
2. What is an optimal strategy? (chapter 2)  
3. How do we learn such a strategy? (chapter 3+)  

# Bibliography

The most cited (and universal) reference for reinforcement learning is the textbook:  
**Reinforcement Learning: an introduction (2nd edition)**, Richard Sutton and Andrew Barto, 2018. MIT Press. [Available online](http://incompleteideas.net/book/the-book.html).

For foundation results on Markov decision processes, an excellent reference is:  
**Markov decision processes: discrete stochastic dynamic programming (2nd edition)**, Martin L. Puterman, 2014. John Wiley & Sons. [Link](https://www.wiley.com/en-us/Markov+Decision+Processes%3A+Discrete+Stochastic+Dynamic+Programming-p-9781118625873).

For a comprehensive view of theoretical foundations of reinforcement learning, these lecture notes are a great source:  
**Theoretical Foundations of Reinforcement Learning**, Csaba Szepesvári, 2020. [Available online](https://rltheory.github.io/).

For deep reinforcement learning, the reader might refer to:  
**An Introduction to Deep Reinforcement Learning**, Vincent Francois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, Joelle Pineau, 2019. [Available online](https://arxiv.org/abs/1811.12560).

There are many more great books and resources that would deserve citing. Here are a few:  
**Algorithms for Reinforcement Learning**, Csaba Szepesvári, 2010. Synthesis Lectures on Artificial Intelligence and Machine Learning, 4(1), 1-103. [Available online](https://sites.ualberta.ca/~szepesva/rlbook.html).  
**Markov decision processes in artificial intelligence**, Sigaud, O., & Buffet, O. (Eds.), 2013. John Wiley & Sons. [Link](https://www.wiley.com/en-fr/Markov+Decision+Processes+in+Artificial+Intelligence-p-9781848211674).  
**Lectures on Reinforcement Learning**, David Silver, 2015. [Available online](https://www.davidsilver.uk/teaching/).  
**From tabular RL to DQN and DDPG**, Olivier Sigaud, 2019. [Available online](https://www.youtube.com/playlist?list=PLe5mY-Da-ksWV330WbfazLUyOuR59sers).  

Everything covered in the present class might not be covered in these books and classes but they form a great foundation for any RL practictioner. Along the class, we will give more detailed and specific references to papers or book chapters to support the claims made, and/or go further.