# 1 - Introduction to Deep Reinforcement Learning

Deep RL is a type of Machine Learning where an agent learns how to behave in an environment by performing actions and seeing the results.

In this first unit, you will:

1. Learn the foundations of Deep Reinforcement Learning.

2. Train your Deep Reinforcement Learning agent, a lunar lander to land correctly on the Moon using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/), a Deep Reinforcement Learning library.

3. Upload this trained agent to the Hugging Face Hub 🤗, a free, open platform where people can share ML models, datasets, and demos.

It’s essential to master these elements before diving into implementing Deep Reinforcement Learning agents. The goal of this chapter is to give you solid foundations.

After this unit, in a bonus unit, you’ll be able to train Huggy the Dog 🐶 to fetch the stick and play with him 🤗.

## 1.1 - What is reinforcement learning?

### 1.1.1 - The big picture

The idea behind Reinforcement Learning is that an **agent** (an AI) will learn from the environment by **interacting with it** (through trial and error) and receiving **rewards** (negative or positive) as feedback for performing actions.

Learning from interactions with the environment comes from our natural experiences.

For instance, imagine putting your little brother in front of a video game he never played, giving him a controller, and leaving him alone.

Your brother will interact with the environment (the video game) by pressing the right button (action). He got a coin, that’s a +1 reward. It’s positive, he just understood that in this game he must get the coins.

<img src="images/Illustration_2.PNG" title="" alt="" width="500" data-align="center">

But then, he presses right again and he touches an enemy. He just died, so that’s a -1 reward.

<img src="images/Illustration_3.PNG" title="" alt="" width="500" data-align="center">

By interacting with his environment through trial and error, your little brother understood that he needed to get coins in this environment but avoid the enemies.

Without any supervision, the child will get better and better at playing the game.

That’s how humans and animals learn, through interaction. Reinforcement Learning is just a computational approach of learning from actions.

### 1.1.2 - A formal definition

Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.

## 1.2 - The reinforcement learning framework

### 1.2.1 - The RL process

The RL process consists of:

* A loop of state
* An action
* A reward

<img src="images/RL_process.jpg" title="" alt="" width="550" data-align="center">

To understand the RL process, let’s imagine an agent learning to play a platform game:

<img src="images/RL_process_game.jpg" title="" alt="" width="550" data-align="center">

* Our Agent receives state $S_{0}$ from the **Environment** — we receive the first frame of our game (Environment).
* Based on that **state** $S_{0}$, the Agent takes action $A_{0}$ — our Agent will move to the right.
* Environment goes to a **new state** $S_{1}$ — new frame.
* The environment gives some **reward** $R_{1}$ to the Agent — we’re not dead (`Positive Reward` +1).

This RL loop outputs a sequence of state, action, reward and next state.

**The agent's goal is to maximize its cumulative reward, called the expected return**.

### 1.2.2 - The reward hypothesis

RL is based on the **reward hypothesis**, which is that all goals can be described as the **maximization of the expected return** (expected cumulative reward).

That’s why in Reinforcement Learning, **to have the best behavior**, we aim to learn to take actions that **maximize the expected cumulative reward**.

### 1.2.3 - Markov property

The RL process is also called the **Markov Decision Process** (MDP).

The Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states and actions** they took before.

### 1.2.4 - Observations / States space

Observations/States are the **information our agent gets from the environment**. In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.

There is a differentiation to make between *observation* and *state*, however:

* *State* $S$ is a complete description of the state of the world (there is no hidden information). In a fully observed environment. For example, in chess game, we receive a state from the environment since **we have access to the whole check board information**.

* *Observation* $O$ is a partial description of the state. In a partially observed environment. For example, in Super Mario Bros, we receive an observation since **we only see a part of the level**.

<img src="images/obs_space_recap.jpg" title="" alt="" width="500" data-align="center">

----

**Note:** In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations.

----

### 1.2.5 - Action space

The action space is the set of all possible actions in the environment. The actions can come from a discrete or continuous space.

* *Discrete space*: the number of possible actions is finite. For example, in Super Mario Bros, we only have 5 possible actions: 4 directions and jumping

* *Continuous space*: the number of possible actions is infinite. For example, A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21.1°, 21.2°, honk, turn right 20°, etc.

<img src="images/action_space.jpg" title="" alt="" width="500" data-align="center">

### 1.2.6 - Rewards and the discounting

The reward is fundamental in RL because is **the only feedback** for the agent. Thanks to it, our agent knows **if the action taken was good or not**.

The cumulative reward at each time step $t$ can be written as:

<img src="images/rewards_1.png" title="" alt="" width="350" data-align="center">

which is equivalent to:

$$
R(\tau) = \sum_{k=0}^{\infty} r_{t+k+1}
$$

However, in reality, **we can't just add them like that**. The rewards that come sooner (at the beginning of the game) **are more likely to happen** since they are more predictable than the long-term future reward.

Let's say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (which can move too). The mouse's goal is **to eat the maximum amount of cheese before being eaten by the cat**.

As we can see in the diagram, **it's more probable to eat the cheese near us than the cheese close to the cat** (the closer we are to the cat, the more dangerous it is).

Consequently, **the reward near the cat, even if it is bigger (more cheese), will be more discounted** since we are not really sure we will be able to eat it.

To discount the rewards, we proceed like this:

1. We define a discount rate called $\gamma$. **It must be between 0 and 1**. Most of the time between **0.99 and 0.95**.
    * The larger the $\gamma$, the smaller the discount. This means our agent cares more about the long-term reward.
    * The smaller the gamme, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).
    
2. Then, each reward will be discounted by $\gamma$ to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less likely to happen.

Our discounted expected cumulative reward is:

<img src="images/rewards_4.png" title="" alt="" width="400" data-align="center">

## 1.3 - The type of tasks

A task is an instance of a RL problem. We can have two types of tasks: **episodic** and **continuing**.

### 1.3.1 - Episodic task

In this case, **we have a starting point and an ending point** (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States. **After each episode, the agent can learn how to choose the best actions taking its previous experience**.

For instance, think about Super Mario Bros: **an episode begin at the launch of a new Mario Level and ending when you are killed or you reached the end of the level**.

### 1.3.2 - Continuing task

These are tasks that **continue forever** (no terminal state). In this case, the agent must **learn how to choose the best actions and simultaneously interact with the environment**.

For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. **The agent keeps running until we decide to stop it**.

<img src="images/tasks.png" title="" alt="" width="500" data-align="center">

## 1.4 - The exploration/exploitation trade-off

Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: *the exploration/exploitation trade-off*.

* *Exploration* is exploring the environment by trying random actions in order to **find more information about the environment**.
* *Exploitation* is **exploiting known information to maximize the reward**.

Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, **we can fall into a common trap**.

In the following game, our mouse can have an *infinite amount of small cheese (+1 each). But at the top of the maze there is a gigantic sum of cheese (+1000).

<img src="images/exp_1.jpg" title="" alt="" width="400" data-align="center">

If we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the **nearest source of rewards**, even if this source is small (exploitation). But if our agent does a little bit of exploration, it can **discover the big reward** (the pile of big cheese).

This is what we call the exploration/exploitation trade-off. We need to balance how much we explore the environment and how much we exploit what we know about the environment. Therefore, we must define a rule that helps to handle this trade-off. We’ll see the different ways to handle it in the future units.

<img src="images/expexpltradeoff.jpg" title="" alt="" width="500" data-align="center">

## 1.5 - Two main approaches for solving RL problems

**How do we build an RL agent that can select the actions that maximize its expected cumulative reward?**

### 1.5.1 - The policy $\pi$: the agent's brain

The policy $\pi$ **is the function that tells us what action to take given the state we are**. So it defines the agent's behavior at a given time.

<img src="images/policy_1.png" title="" alt="" width="450" data-align="center">

This policy **is the function we want to learn**, our goal is to find the optimal policy $\pi^{*}$, the policy that maximizes the expected return when the agent acts according to it. We find this $\pi^{*}$ **through training**.

There are two approaches to train our agent to find this optimal policy $\pi^{*}$

* Directly, **teach the agent to learn which action to take**, given the current state: **Policy-based methods**.
* Indirectly, **teach the agent which state is more valuable** and then take the action that leads to the more valuable states: **Value-based methods**.

### 1.5.2 - Policy-based methods

In policy-based methods, **we learn a policy function directly**.

This function will define a mapping between each state and the best corresponding action. We can also say that **it'll define a probability distribution over the set of possible actions at that state**. 

We have two types of policies:

* **Deterministic:** a policy at a given state will always return the same action. $a = \pi(s)$

* **Stochastic:** outputs a probability distribution over actions. $\pi(a|s) = P(A|s)$ i.e., probability distribution over the set of actions given the state.

 <table>
  <tr>
    <th>Determininstic</th>
    <th>Stochastic</th>
  </tr>
  <tr>
    <td><img src="images/pbm_1.jpg" title="" alt="" width="450" data-align="center"></td>
    <td><img src="images/pbm_2.jpg" title="" alt="" width="450" data-align="center"></td>
  </tr>
</table> 

### 1.5.3 - Value-based methods

In value-based methods, instead of training a policy function, **we train a value function that maps a state to the expected value of being at that state**.

The value of a state is the expected discounted return the agent can get if it starts in that state, and then act according to our policy, where "act according to our policy" just means that our policy is "going to the state with the highest value".

<img src="images/value_1.png" title="" alt="" width="450" data-align="center">

<img src="images/value_2.jpg" title="" alt="" width="450" data-align="center">

Here we can see that **our value function defined value for each possible state**. Thanks to this function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.

## 1.6 - The "Deep" in Reinforcement Learning

Deep RL introduces deep neural networks to solve RL problems - hence the name "deep".

For instance, in the next unit, we'll learn about two value-based algorithms:

* Q-Learning (classic RL). We use a traditional algorithm to create a Q table that helps us find what action to take for each state.
* Deep Q-Learning. We use a neural network to approximate the Q value.

<img src="images/deep.jpg" title="" alt="" width="550" data-align="center">


## 1.7 - Additional readings

### 1.6.1 - Deep RL

* [Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 1, 2 and 3](http://incompleteideas.net/book/RLbook2020.pdf)
* [Foundations of Deep RL Series, L1 MDPs, Exact Solution Methods, Max-ent RL by Pieter Abbeel](https://www.youtube.com/watch?v=2GwBez0D20A&feature=youtu.be)
* [Spinning Up RL by OpenAI Part 1: Key concepts of RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html)
* [A (long) peek into reinforcement learning](https://lilianweng.github.io/posts/2018-02-19-rl-overview/)

### 1.6.2 - OpenAI Gym
* [Getting Started With OpenAI Gym: The Basic Building Blocks](https://blog.paperspace.com/getting-started-with-openai-gym/)
* [Make your own Gym custom environment](https://www.gymlibrary.dev/content/environment_creation/)