# 2 - Introduction to Q-Learning

In this second unit, we are going to dive deeper into one of the Reinforcement Learning approaches (i.e., value-based methods) and study our first RL algorithm: **Q-Learning**.

We will also implement our first RL agent from scratch, a Q-Learning agent, and will train it in two environments:

1. Frozen-Lake-v1 (non-slippery version): where our agent will need to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoiding holes (H).
2. An autonomous taxi: where our agent will need to learn to navigate a city to transport its passengers from point A to point B.

<img src="images/envs.gif" title="" alt="" width="600" data-align="center">

Concretely, we will:

* Learn about **value-based methods**.
* Learn about the **differences between Monte Carlo and Temporal Difference Learning**.
* **Study and implement** our first RL algorithm: **Q-Learning**

## 2.1 - Short recap

As a reminder, there are two main types of RL methods:
    
* <span style="color:blue">Policy-based methods:</span> Train the policy directly to **learn which action to take given a state**
* <span style="color:blue">Value-based methods:</span> Train a value function to **learn which state is more valuable and use this value function to take the action that leads to it**

<img src="images/two-approaches.png" title="" alt="" width="500" data-align="center">

## 2.2 - Two types of value-based methods

In value-based methods, we learn a value function that maps a state to the expected value of being at that state. The value of a state is the expected discounted return the agent can get if it starts at that state and then acts according to our policy.

-----

<span style="color:red"><b>Question:</b></span> But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods since we train a value function and not a policy.

<span style="color:blue"><b>Answer:</b></span> The policy function is "latent", we dont define by hand the behaviour of our policy; **it is the training that will indirectly define it**. Now, since the policy is not trained/learned, **we need to specify its behavior**. For instance, if we want a policy that, given the value function, will take actions that always lead to the biggest reward, we'll create a Greedy Policy.

-----

Whatever method we use to solve the problem, we will have a policy. In the case of value-based methods, you don’t train the policy: your policy is just a simple pre-specified function (for instance, Greedy Policy) that uses the values given by the value-function to select its actions. <span style="color:blue">The value function is usually a <b>Neural network</b>, hence the deep RL name</span>.

<img src="images/link-value-policy.png" title="" alt="" width="500" data-align="center">

So, we have two types of value-based functions:

### 2.2.1 - The state-value function

We write the state value function under a policy $\pi$ like this:

<img src="images/state-value-function-1.png" title="" alt="" width="500" data-align="center">

For each state, the state-value function outputs the expected return if the agent starts at that state and then follows the policy forever afterward (for all future timesteps, if you prefer).

<img src="images/state-value-function-2.png" title="" alt="" width="500" data-align="center">

### 2.2.2 - The action-value function

In the action-value function, for each state and action pair, the action-value function outputs the expected return if the agent starts in that state and takes action, and the follows the policy forever after. 

The value of taking action $a$ in state $s$ under a policy $\pi$ is:

<img src="images/action-state-value-function-1.png" title="" alt="" width="500" data-align="center">

<img src="images/action-state-value-function-2.jpg" title="" alt="" width="500" data-align="center">


### 2.2.3 - Differences

We see that the difference is:

* In state-value function, we calculate **the value of a state** $S_{t}$
* In action-value function, we calculate **the value of the state-action pair** $(S_{t}, A_{t})$; i.e., the value of taking that action at that state.

In either case, the returned value is the expected return. 

However, the problem is that **it implies that to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state or state-action pair**.

This can be a **computationally expensive process**, and that it where the <span style="color:blue"><b>Bellman equation</b></span> comes to help us.

## 2.3 - The Bellman Equation

With what we have learned so far, we know that if we calculate the $V(S_{t})$ (value of a state), we need to calculate the return starting at that state and then follow the policy forever after. So, we need to calculate the sum of the expected rewards:

<img src="images/bellman2.png" title="" alt="" width="500" data-align="center">

The thing is, to calculate $V(S_{t+1})$$, we are going to repeat the computation of the value of several states. Therefore, instead of doing this repetitive computation, we can use the **Bellman equation** (hint: if you know what <a href="https://en.wikipedia.org/wiki/Dynamic_programming">dynamic programming</a> is, this is very similar)

The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as **the immediate reward $R_{t+1}$ + the discounted value of the state that follows ($\gamma * V(S_{t+1})$)**. So, we are basically accumulating the "local reward sum" so we can speed up the computational process.

<img src="images/bellman4.png" title="" alt="" width="500" data-align="center">



## 2.4 - Monte Carlo vs Temporal Difference Learning

Monte Carlo and Temporal Difference Learning are two different **strategies on how to train our value function or our policy function**. both of them use experience to solve the RL problem.

On one hand, Monte Carlo uses an entire episode of experience before learning. On the other hand, Temporal Difference uses only a step $(S_{t}, A_{t}, R_{t+1}, S_{t+1})$ to learn.

### 2.4.1 - Monte Carlo: learning at the end of the episode

Monte Carlo waits until the end of the episode, calculates $G_{t}$ (return) and uses it as a target for updating $V(S_{t})$. **It requires a complete episode of interaction before updating our value function**.

<img src="images/monte-carlo-approach.jpg" title="" alt="" width="500" data-align="center">

Let's consider the mouse & cheese game as an example:

* We always start the episode at the same starting point 
* The agent takes actions using this policy. For instance, using an Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation.
* On each step, we get **the reward and the next state**.
* We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
* At the end of the episode, we have a list of State, Actions, Rewards, and Next States tuples For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]…]
* The agent will **sum the total rewards** $G_{t}$ (to see how well it did).
* It will then **update $V(S_{t})$ based on the formula**
* **Start a new game with this new knowledge**

By running more and more episodes, **the agent will learn to play better and better**.

<img src="images/MC-3p.jpg" title="" alt="" width="500" data-align="center">

For instance, if we train a state-value function using Monte Carlo:

<img src="images/MC-4p.jpg" title="" alt="" width="500" data-align="center">

We have a list of state, action, rewards, next_state, **we need to calculate the return $G_{t}$ from this episode**:

* $G_{t} = R_{t+1} + R_{t+2} + R_{t+3} \dots$ (for simplicity we don't discount the rewards)
* $G_{t} = 1+0+0+0+0+0+1+1+0+0 = 3$

We can now update $V(S_{0})$:

<img src="images/MC-5.png" title="" alt="" width="300" data-align="center">

* New $V(S_{0}) = V(S_{0}) + lr * [G_{t} - V(S_{0})]$
* New $V(S_{0}) = 0 + 0.1 * [3-0] = 0.3$

**Note:** We would repeat this process for all of the states

### 2.4.2 - Temporal Difference Learning: learning at each step

Temporal difference **waits for only one interaction (one step) $S_{t+1}$ to form a TD target and update $V(S_{t})$ using $R_{t+1}$ and $\gamma * V(S_{t+1})$**.

But, since we didn't experience an entire episode, we don't have $G_{t}$ (expected return). Instead, we estimate $G_{t}$ by adding $R_{t+1}$ and the discounted value of the next state. This is called bootstrapping because TD bases its update part on an existing estimate $V(S_{t+1})$ and not on a complete sample $G_{t}$.

<img src="images/TD-1.jpg" title="" alt="" width="500" data-align="center">

If we take the sample example as before (mouse & cheese game):

* We just started to train our value function, so it returns 0 value for each state.
* Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
* Our mouse explore the environment and take a random action: **going to the left**
* It gets a reward $R_{t+1}$ since **it eats a piece of cheese**

<img src="images/TD-1p.jpg" title="" alt="" width="500" data-align="center">

We can now update $V(S_{0})$:

<img src="images/TD-3.png" title="" alt="" width="350" data-align="center">

* New $V(S_{0}) = V(S_{0}) + lr * [R_{1} + \gamma * V(S_{1}) - V(S_{0})]$
* New $V(S_{0}) = 0 +0.1 * [1+1*0-0] = 0.1$

We just updated our value function for state 0. Now we **continue to interact with this environment with our updated value function**.

## 2.5 - Mid-way recap

We have two types of value-based functions:
    
* <span style="color:blue"><b>State-value function:</b></span> outputs the expected return if the agent starts at a given state and acts accordingly to the policy forever after.
* <span style="color:blue"><b>Action-value function:</b></span> outputs the expected return if the agent starts in a given state, takes a given action at that state and then acts accordingly to the policy forever after.

In value-based methods, rather than learning the policy, **we define the policy by hand and we learn a value function**. **If we have an optimal value function, we will have an optimal policy**.

There are two types of methods to learn a policy for a value function:

* With the *Monte Carlo method*, we update the value function from a complete episode, and so we use the actual accurate discounted return of this episode.
* With the *TD Learning method*, we update the value function from a step, so we replace $G_{t}$ that we don't have with an estimated return called TD target.

<img src="images/summary-learning-mtds.jpg" title="" alt="" width="500" data-align="center">


## 2.6 - Introducing Q-Learning

### 2.6.1 - What is Q-Learning

Q-Learning is an off-policy value-based method that uses a TD approach to train its action-value function. **Q-learning is the algorithm we use to train our Q-function**, an **action-value** function that determines the value of being at a particular state and taking a specific action at that state.

<img src="images/Q-function.png" title="" alt="" width="500" data-align="center">

Given a state and action, our Q Function outputs a state-action value (also called Q-value)

The **Q comes from "the Quality" (the value) of that action at that state**. Let's quickly recap the difference between value and reward:

* The value of a state, or a state-action pair is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.
* The reward is the feedback I get from the environment after performing an action at a state

**Internally, our Q-function has a Q-table, a tabe where each cell corresponds to a state-action pair value.** <span style="color:Blue"><b>Think of this Q-table as the memory or cheat-sheet of our Q-function.</b></span>. given a state and action, our Q-function will search inside its Q-table to output the value.

Consider the following "maze" and Q-table as an example:

<img src="images/Maze-3.png" title="" alt="" width="500" data-align="center">

If we recap, <span style="color:Blue">Q-Learning</span> is the RL algorithm that:

* Trains a Q-function (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values**.
* Given a state and action, our Q-function will **search into its Q-table the corresponding value**.
* **When the training is done, we have an optimal Q-function, which means we have an optional Q-table**.
* And if we have an optimal Q-function, **we have an optimal policy since we know for each state what is the best action to take**.

But, in the beginning, our Q-table is useless since it gives arbitrary values for each state-action pair (most of the time, we initialize the Q-table to 0). As the agent explores the environment and we update the Q-table, it will give us better and better approximations to the optimal policy:

<img src="images/Q-learning-1.png" title="" alt="" width="500" data-align="center">


### 2.6.2 - The Q-Learning algorithm

Now that we understand what Q-Learning, Q-function, and Q-table are, let’s dive deeper into the Q-Learning algorithm.

<img src="images/Q-learning-2.png" title="" alt="" width="550" data-align="center">

#### Step 1: Initialize the Q-table (usually with values of 0)

<img src="images/Q-learning-3.png" title="" alt="" width="550" data-align="center">

#### Step 2: Choose an action using epsilon-greedy strategy

<img src="images/Q-learning-4.png" title="" alt="" width="550" data-align="center">

The idea is that we define the initial epsilon $\epsilon = 1.0$:

* With probability $1-\epsilon$: we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
* With probability $\epsilon$: we do **exploration** (aka our agent tries a random action).

At the beginning of the training, **the probability of doing exploration will be huge since $\epsilon$ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation.

<img src="images/Q-learning-5.png" title="" alt="" width="200" data-align="center">

#### Step 3: Perform action $A_{t}$, get reward $R_{t+1}$ and next state $S_{t+1}$

<img src="images/Q-learning-6.png" title="" alt="" width="500" data-align="center">

#### Step 4: Update $Q(S_{t}, A_{t})$

Remember that in TD learning, we update our policy or value function (depending on the RL method we choose) **after one step of the interaction**. To produce our TD target, **we use the immediate reward $R_{t+1}$ plus the discounted value of the next best state-action pair** (we call that bootstrap). Thefore, our $Q(S_{t}, A_{t})$ update formula goes like this:

<img src="images/Q-learning-8.png" title="" alt="" width="500" data-align="center">

This means that to update our $Q(S_{t}, A_{t})$:
* We need $S_{t}, A_{t}, R_{t+1}, S_{t+1}$
* To update our Q-value at a given state-action pair, we use the TD target

How do we form the TD target?

1. We obtain the reward $R_{t+1}$ after taking the action.
2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.

Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**. This is why we say **Q-learning is an off-policy algorithm**.

### 2.6.3 - Off-policy vs On-policy


* Off-policy. **Using a different policy for acting (inference) and updating (training)**. For instance, with Q-learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that is used to select the best next-state action value to update our Q-value (updating policy). 

* On-policy. **Using the same policy for acting and updating**. For insance, with [Sarsa](https://en.wikipedia.org/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action), another value-based algorithm, the **epsilon-greedy policy selects the next state-action pair, not a greedy policy.**

<img src="images/off-on-4.png" title="" alt="" width="600" data-align="center">


## 2.7 - A Q-Learning example