---
title: "RL Unit 2: Introduction to Q Learning"
description: "Unit 2 Learnings from Hugging Face RL Course"
format:
    html:
        code-fold: true
render-on-save: true
execute:
    eval: false
    echo: true
jupyter: python3
output:
  quarto::html_document:
    self_contained: false
    keep_md: false

categories:
    - Re-inforcement Learning
    - Regression Project
image: ./images/RL2_QLearning.jpg
---

## Chapter 2: INTRODUCTION TO Q-Learning

- Back in previous class we learned about Reinforcement Learning, the RL process and the different methods to solve an RL problem.
- For this Unit we will be learning about:
    - Value-based Methods
    - Difference between Monte Carlo and Temporal Difference Learning
    - Study and implement our first RL algorithm: Q-learning.

### RL Recap

- The goal of RL to build an agent that can make smart decisions.
- Smart decisions will occur, when the agent will learn from the env, by interacting with it through trial and error and receiving rewards as unique feedback.
- It's goal is to maximize it's expected cumulative reward.
- Thus we need to train the agent's brain i.e. the policy for this.

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/policy.jpg" width = 600 height = 600>

- Our goal now from maximizing the expected cumulative reward, now changes to learning a policy which maximizes the expected cumulative reward for us
- We do this by 2 methods:
    - Policy based methods: Train the policy directly to learn which action to take given a state
    - Value based methods: Train a value function to learn which state is more valuable and use this value function to take the action that leads to it
    
- We will be focusing on value based methods for this unit

<img src = 'https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches.jpg' height = 600 width = 600>

### Value-based methods

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/vbm-1.jpg" width = 600 height = 600>

RL agent's goal is to have an optimal policy $\pi^*$
-  To find this policy we have 2 methods:
    - Policy Based Methods: Here we don't need any value function.
        -  We don’t define by hand the behavior of our policy; it’s the training that will define it.

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-2.jpg" width = 600 height = 600>

- Value Based Methods: Indirectly, by training a value function that outputs the value of a state or a state-action pair. 
    - Given this value function, our policy will take an action.
    - Since the policy is not trained/learned, we need to specify it's behaviour by hand. 
    - For instance, if we want a policy that, given the value function, will take actions that always lead to the biggest reward, we'll have a Greedy Policy

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-3.jpg" width = 600 height = 600>

- Consequently, whatever method you use to solve your problem, you will have a policy. In the case of value-based methods, you don’t train the policy: your policy is just a simple pre-specified function (for instance, the Greedy Policy) that uses the values given by the value-function to select its actions.

- So the difference is:
    - In policy-based training, the optimal policy (denoted π*) is found by training the policy directly.
    - In value-based training, finding an optimal value function (denoted Q* or V*, we’ll study the difference below) leads to having an optimal policy.

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" height = 600 width = 600>

- In Value based methods we have 2 types of value based functions:

- State value function under a policy $\pi$ 
    - For each state, the state-value function outputs the expected return if the agent starts at that state and then follows the policy forever afterward (for all future timesteps, if you prefer)
    - In Value based methods we have 2 types of value based functions:- If we take the state with value -7: it's the expected return starting at that state and taking actions according to our policy (greedy policy), so right, right, right, down, down, right, right.


<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-1.jpg" width = 400 height  = 400> <img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-2.jpg" height = 400 width = 400>

- Action value function 
In the action-value function, for each state and action pair, the action-value function outputs the expected return if the agent starts in that state, takes that action, and then follows the policy forever after.

- The value of taking action $a$ in the state $s$ under a policy $\pi$ is :

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-1.jpg" height = 400 width = 400><img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-2.jpg" height = 400 width = 400>

- We see that the difference is:
    - For the state-value function, we calculate the value of a state $S_t$
    - For the action-value functions, we calculate the value of the state-action pair $(S_t, A_t)$ hence the value of taking that action at that state

<img src = 'https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-types.jpg' width = 400 height = 400>

- In either case, whichever value function we choose (state-value or action-value function), the returned value is the expected return.
- However, the problem is that to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.
- This can be a computationally expensive process, and that’s where the Bellman equation comes in to help us.

### Bellman Equation to simplify the value estimation

- With what we have learned so far, we know that if we calculate $V(S_t)$, we need to calculate the return starting at that state and then follow the policy forever after. (The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).
- So to calculate $V(S_t)$, we need to calculate the **sum** of the expected rewards. Hence: 
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" width = 400 height = 400>


- Then to calculate $V(S_{t+1})$, we need to calculate the return starting at that state $S_{t+1}$
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman3.jpg" width = 400 height = 400>

- So basically we're repeating the computation for the value of different states, which can be tedious if needs to be done for each state value or state-action value.
- So to simplify this we use Bellman equation which is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
    - The immedicate reward $R_{t+1}+$ the discounted value of the state that follows (gamma * $V(S_{t+1})$)
    - If we go back to our example, we can say that the value of State 1 is equal to the expected cumulative return if we start at that state.
    - To calculate the value of State 1: the sum of rewards if the agent started in that state 1 and then followed the policy for all the time steps.
    - This is equivalent to $V(S_t) = $ Immediate reward $R_{t+1}$ + Discounted value of the next state ($\gamma * V(S_{t+1}))$

- In the interest of simplicity, here we don't discount, so gamma= 1. But you'll study an example with gamma = 0.99 in the Q-Learning section of this unit.
    - The value of $V(S_{t+1}) = $ Immediate reward $R_{t+2}$ + Discounted value of the next state ($\gamma * V(S_{t+2}))$

- To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, which is a long process, we calculate the value as the sum of immediate reward + the discounted value of the state that follows.
- Before going to the next section, think about the role of gamma in the Bellman equation. What happens if the value of gamma is very low (e.g. 0.1 or even 0)? What happens if the value is 1? What happens if the value is very high, such as a million?

### Monte Carlo vs Temporal Difference Learning

- Since we know that the RL agent learns by interacting with the environment. 
- The idea is that given the experience and the received reward, the agent will update it's value function of policy.
- There are 2 different strategies on how to train our value function or policy function. 
    - Both of them use experience to solve the RL problem, i.e. the SARSA
    - The 2 strategies are Monte Carlo and Temporal Differnce
        - Monte Carlo uses an entire episode of experience before learning. 
        - Temporal difference uses only a step. $(S_t, A_t, R_{t+1}, S{t+1})$

#### Monte Carlo: Learning at the end of the episode
    
- Monte Carlo waits until the end of the episode, calculates $G_t$ (return) and uses it as a target for updating $V(S_t)$
- So it requires a complete episode of interaction before updating our value function.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" height = 400 width = 400>

- We always start the episode at the same starting point.
- The agent takes actions using the policy. For instance, using an Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation.
- We get the reward and the next state.
- We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
- At the end of the episode, we have a list of State, Actions, Rewards, and Next States tuples For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]…]
- The agent will sum the total rewards $G_t$ (to see how well it did).
- It will then update $V(S_t)$ based on this formula
- Then start a new game with this new knowledge

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3.jpg" height = 400 width = 400>  <img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3p.jpg" height = 400 width = 400>


- For instance, if we train a state-value function using Monte Carlo:

- We initialize our value function so that it returns 0 value for each state
- Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
- Our mouse explores the environment and takes random actions
- The mouse made more than 10 steps, so the episode ends .

- We have a list of state, action rewards, next_state, we need to calculate the return $G_t$
- $G_t = R_{t+1} +R_{t+2}+R_{t+3} $....
- $G_t = R_{t+1} +R_{t+2}+R_{t+3} $....(for simplicity we don't discount the rewards)
- $G_t = 1+0+0+0+0+0+1+1+0+0$
- $G_t = 3$ 
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5p.jpg" width = 400 height = 400>


- **I think it's a hyperparameter, about how many times we iterate before we update the single state**

#### Temporal Difference Learning: learning at each step
- Temporal differnce, on the other hand, waits for only one interaction (one step) $S_{t+1}$ to form a TD target and update $V(S_t)$ using $R_{t+1}$ and $\gamma*V(S_{t+1})$
- The idea with TD is to update the $V(S_t)$ at each step
- But because we didn't experience an entire episode, we don't have $G_t$ (expected return). Instead, we estimate $G_t$ by adding $R_{t+1}$(reward that came by current action) and the discounted value of the next state
- This is called bootstrapping. It's called this because TD bases it's update in part on an existing estimate $V(S_{t+1})$ and not a complete sample $G_t$
- This method is called TD(0) or one-step TD (update the value function after any individual step)


<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" height = 400 width = 400> <img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1p.jpg" height = 400 width = 400>

- For the mouse cat example we would have something as follows 
    - We initialize our value function so that it returns 0 value for each state.
    - Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
    - Our mouse begins to explore the environment and takes a random action: going to the left
    - It gets a reward $R_{t+1}$ since it eats a piece of cheese
    
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2p.jpg" height = 400 width = 400>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3.jpg" height = 400 width = 400>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3p.jpg" height = 400 width = 400>

- **Here as well, I believe it is a hyper parameter, how many times is a state updated**

#### To summarize 

- So basically first we decide which way to train our policy, once that is decided, we ask how do we train that's where these 2 strategies come into picture.
- With Monte Carlo, we update the value function from a complete episode, and so we use the actual accurate discounted return of this episode.
- With temporal difference learning, we update the value function from a step, and we replace $G_t$, which we don't know with an estimated return called the TD target.
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Summary.jpg" height = 400 width = 400>

### Introducing Q-Learning

What is Q-Learning ?
- Q learning is an off-policy value-based method that uses a temporal difference approach to train it's action value function:
    - Off-policy: We'll see this at the end.
    - Value-based method: Finds the optimal policy indirectly by training a value or action-value function that will tell us the value of each state or each state-action pair.
    - TD approach: updates its action-value function at each step intead of at the end of the episode.

- Q-Learning is the algorithm we use to train our Q-function, an action-value function that determines the value of being at a particular state and taking a specific action at that state.
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" width = 400 height = 400>

- In Q-Learning the Q stands for quality (the value) of that action at that state.
Also to recap here is the difference between value and reward:
- The value of a state, or a state-action pair is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to it's policy.
- The reward is the feedback the agent gets' from the environment after performing an action at a state.

- Internally, our Q-function is encoded by a **Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function**.

- So overall in Q-learning we train our action value function known as Q-function. This Q-function is encoded as a Q-table, where each cell corresponds to a state-action pair value. Think of this Q-table as the memory or cheat sheet of our Q-function.

- Let's take an example with this simple maze:
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg" height = 400 width = 400>

- The Q-table is initialized. That's why all the values are = 0. This table contains, for each state and action, the corresponding state-action values.
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-2.jpg" height = 400 width = 400>

- Here we see that the state-action value of the initial state and going up is 0:
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-3.jpg" height = 400 width = 400>

- So: the Q-function uses a Q-table that has the value of each state-action pair. Given a state and action, our Q-function will seacrch inside it's Q-table to output the value.
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" height = 400 width = 400>

If we recap, Q-Learning is the RL algorithm that:

- Trains a Q-function (an action-value function), which internally is a Q-table that contains all the state-action pair values.
- Given a state and action, our Q-function will search its Q-table for the corresponding value.
- When the training is done, we have an optimal Q-function, which means we have optimal Q-table.
- And if we have an optimal Q-function, we have an optimal policy since we know the best action to take at each state.

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" height = 400 width = 400>

- In the beginning, our Q-table is useless since it gives arbitrary values for each state-action pair (most of the time, we initialize the Q-table to 0). As the agent explores the environment and we update the Q-table, it will give us a better and better approximation to the optimal policy.

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-1.jpg" width = 400 height = 400>

- Now that we understand what Q-Learning, Q-functions, and Q-tables are, let’s dive deeper into the Q-Learning algorithm.

#### The Q-Learning Algorithm

- This is the Q-Learning pseudocode; let’s study each part and see how it works with a simple example before implementing it. Don’t be intimidated by it, it’s simpler than it looks! We’ll go over each step.

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" width = 400 height = 400>
- Step 1: We initialize the Q-table, **most of the time, we initialize with values of 0**

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-3.jpg" width = 400 height = 400>

- Step 2: Choose an action using the epsilon-greedy strategy

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" width = 400 height = 400>

- The idea is that, with an initial value of ɛ = 1.0:
    - With probability 1 — ɛ : we do exploitation (aka our agent selects the action with the highest state-action pair value).
    - With probability ɛ: we do exploration (trying random action).

- At the beginning of the training, the probability of doing exploration will be huge since ɛ is very high, so most of the time, we’ll explore. But as the training goes on, and consequently our Q-table gets better and better in its estimations, we progressively reduce the epsilon value since we will need less and less exploration and more exploitation.

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-5.jpg" width = 400 height = 400>


- Step 3: Perform action $A_t$, get reward $R_{t+1}$ and the next state $S_{t+1}$

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-6.jpg" width = 400 height = 400>

- Step 4: Update Q($S_t, A_t$)

Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) after one step of the interaction.

- To produce our TD target, we used the immediate reward $R_{t+1}$ plus the discounted value of the next state, computed by **finding the action** that **maximizes the current Q-function at the next state.** (We call that bootstrap).

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-7.jpg" width = 400 height = 400>

- Therefore, our Q($S_t, A_t$) update formula goes like this:

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-8.jpg" width = 400 height = 400>

- This means to update our $Q(S_t, A_t)$:
    - We need $S_t, A_t, R_{t+1}, S_{t+1}$
    - To update our Q-value at a given state-action pair, we use the TD target.

How do we form the TD target?
- We obtain the reward after taking the action $R_{t+1}$
- To get this best state-action pair value for the next state, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value (So there is no probability involved here simply choose that action which will take us to the next state having max Q-value and thus our Q-value for this state becomes optimal)
- Then when the update of this Q-value is done, we start in a new state (which will come by the action that leads us to that state which has the best Q-value) and select our action using a epsilon-greedy policy again
- This is why we say that Q Learning is an off-policy algorithm

#### Off-policy vs On-policy
- The difference is subtle:
    - Off-policy: using a different policy for acting (inference) and updating (training).
        - For instance, with Q-Learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that is used to select the best next-state action value to update our Q-value (updating policy).
        -  Each update can use data collected at any point during training, regardless of how the agent was choosing to explore the environment when the data was obtained. 

- Acting Policy:
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-1.jpg" width = 600 height =600>

Is different from the policy we use during the training part:

<img src ="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-2.jpg" width = 300 height = 300>    

- On-policy: using the same policy for acting and updating.
    - For instance, with Sarsa, another value-based algorithm, the epsilon-greedy policy selects the next state-action pair, not a greedy policy
    - Each update only usees data collected while acting according to the most recent version of the policy
    

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-3.jpg" width = 600 height = 600>

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" width = 600 height = 600>

#### Let's discuss this with an example 

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-Example-2.jpg" width = 600 height = 600>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-1.jpg" height = 600 width = 600>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-2.jpg" height = 600 width = 600>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" height = 600 width = 600>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-3.jpg" height = 600 width = 600>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-4.jpg" height = 600 width = 600>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" height = 600 width = 600>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-4.jpg" width = 600 height = 600>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-6.jpg" width = 600 height = 600>|
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-7.jpg" width = 600 height = 600>
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-8.jpg" width = 600 height = 600>

### Recap

#### Summary of Value Based Functions and Strategies to train the value functions

- We have two types of value-based functions:
    - State-value function: outputs the expected return if the agent starts at a given state and acts according to the policy forever after.
    - Action-value function: outputs the expected return if the agent starts in a given state, takes a given action at that state and then acts accordingly to the policy forever after.
    - In value-based methods, rather than learning the policy, we define the policy by hand and we learn a value function. If we have an optimal value function, we will have an optimal policy.

- There are two types of methods to learn a policy for a value function:
    - With the Monte Carlo method, we update the value function from a complete episode, and so we use the actual discounted return of this episode.
    - With the TD Learning method, we update the value function from a step, replacing the unknown $G_t$ with an estimated return called the TD target.

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/summary-learning-mtds.jpg" width =600 height = 600>

#### Q-Learning Recap

- Q-Learning is the RL algorithm that :
    - Trains a Q-function, an action-value function encoded, in internal memory, by a Q-table containing all the state-action pair values.
    - Given a state and action, our Q-function will search its Q-table for the corresponding value.
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" width = 600 height = 600>    

- When the training is done, we have an optimal Q-function, or, equivalently, an optimal Q-table.
- And if we have an optimal Q-function, we have an optimal policy, since we know, for each state, the best action to take.
<img src ="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" width = 600 height =600>

- But, in the beginning, our Q-table is useless since it gives arbitrary values for each state-action pair (most of the time we initialize the Q-table to 0 values). But, as we explore the environment and update our Q-table it will give us a better and better approximation.
<img src ="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" width =600 height =600 >

- This is the Q-Learning pseudocode:
<img src ="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" height = 600  width = 600>

### Glossary

- **Strategies to find the optimal policy**
    - Policy-based methods. The policy is usually trained with a neural network to select what action to take given a state. In this case it is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
    - Value-based methods. In this case, a value function is trained to output the value of a state or a state-action pair that will represent our policy. However, this value doesn’t define what action the agent should take. In contrast, we need to specify the behavior of the agent given the output of the value function. For example, we could decide to adopt a policy to take the action that always leads to the biggest reward (Greedy Policy). In summary, the policy is a Greedy Policy (or whatever decision the user takes) that uses the values of the value-function to decide the actions to take.
    
- **Among the value-based methods, we can find two main strategies**
    - The state-value function. For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
    - The action-value function. In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state and takes an action. Then it follows the policy forever after.

- **Epsilon-greedy strategy:**
    - Common strategy used in reinforcement learning that involves balancing exploration and exploitation.
    - Chooses the action with the highest expected reward with a probability of 1-epsilon.
    - Chooses a random action with a probability of epsilon.
    - Epsilon is typically decreased over time to shift focus towards exploitation.

- **Greedy strategy:**
    - Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (Only exploitation)
    - Always chooses the action with the highest expected reward.
    - Does not include any exploration.
    - Can be disadvantageous in environments with uncertainty or unknown optimal actions.

- **Off-policy vs on-policy algorithms**
    - Off-policy algorithms: A different policy is used at training time and inference time
    - On-policy algorithms: The same policy is used during training and inference