---
title: "Introduction to RL HF"
description: "Unit 1 Learning from Hugging Face RL Course"
format:
    html:
        code-fold: true
render-on-save: true
execute:
    eval: false
    echo: true
jupyter: python3
output:
  quarto::html_document:
    self_contained: false
    keep_md: false

categories:
    - End To End Project
    - Regression Project
image: ./images/RL1_Introduction.jpg
---

# Hugging Face Reinforcement Learning course

## Chapter 1: INTRODUCTION TO DEEP REINFORCEMENT LEARNING

- Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback
- RL Process: Imagine an agent learning to play a platform game:
    - Our agent receives state $S_0$ from the environment - we receive the first frame of our game
    - Based on that state $S_0$, the Agent takes action $A_0$ - our agent will move to the right
    - The environment goes to a new state $S_1$ - new frame
    - The environment gives some reward $R_1$ to the agent - we're not dead (Positive Reward +1)
- This RL loop outputs a sequence of state, action, reward, and next state: $S_0, A_0, R_1, S_1$
- The agent's goal is to maximize it's cumulative reward, callled the expected return.

### The reward hypothesis: the central idea of Reinforcement Learning

- RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).
- That’s why in Reinforcement Learning, to have the best behavior, we aim to learn to take actions that maximize the expected cumulative reward.

### Markov Property
- Markov property implies that our agent needs only the current state to decide what action to take and not the history of all the states and actions they took before.

### Observations/States Space
- Observations/States are the information our agent gets from he environment. In the case of a video game, it can be a frame, in case of a trading agent, it can be the value of a certain stock.
- There is a differentiation to make between observation and state, however:
    - State s: is a complete description of the state of the world.
        - In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.
    - Observation o: is a partial description of the state. 
        - In Super Mario Bros, we are in a partially observed environment. We receive an observation since we only see a part of the level.

### Action Space
- Action space is the set of all possible actions in an environment.
    - The actions can come from a discrete or continuous space:
        - Discrete space: the number of possible actions is finite. Ex. In Super Mario Bros, we have a finite set of actions since we have only 4 directions and jump.
        - Continous space: A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°.
    - *Taking this information into consideration is crucial because it will have importance when choosing the RL algorithm in the future.*

### Rewards and the discounting
- The rewared is fundamental in RL because it's the only feedback for the agent. Because of this our agent knows if the action taken was good or not.
    - The cumulative reward at each time step t, equals the sum of all rewards in the sequence.

<img src="./images/RL_1_rewards_2.jpg" alt="Alt text" title="Optional title" width = 200 height = 300>

However, in reality, we can’t just add them like that. The rewards that come sooner (at the beginning of the game) are more likely to happen since they are more predictable than the long-term future reward.

<img src="./images/RL_1_rewards_3.jpg" alt="Alt text" title="Optional title" width = 600 height = 600>

- Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse’s goal is to eat the maximum amount of cheese before being eaten by the cat.
- As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).
- Consequently, the reward near the cat, even if it is bigger (more cheese), will be more discounted since we’re not really sure we’ll be able to eat it.

- To discount the rewards, we proceed like this:
    - 1. We define a discount rate called gamma. It must be between 0 and 1. Most of the time between 0.95 and 0.99.
        - The larger the gamma, the smaller the discount. This means our agent cares more about the long-term reward.
        - On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).
    - 2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less likely to happen.

Our discounted expected cumulative reward is:

<img src="./images/RL_1_rewards_4.jpg" alt="Alt text" title="Optional title" width = 600 height = 600>

### Tasks

- A task is an instance of a Reinforcement learning problem. We can have 2 types of tasks: episodic and continuing.
- Episodic task
    - In this case, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States.
        - For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ends when you’re killed or you reached the end of the level.
- Continuing task:
    - These are tasks that continue forever (no terminal state). In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.
        - For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop it.



### Exploration vs Exploitation

- Exploration is exploring the environment by trying random actions in order to find more information about the environment
- Exploitation is exploiting known information to maximize the reward
- **Remember the goal of our RL agent is to maximize the expected cumulative reward. However, one can fall in the trap of exploiting the known rewards all the time.**

Example
- In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).
    - However, if we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).
    - But if our agent does a little bit of exploration, it can discover the big reward (the pile of big cheese).
    
- If it’s still confusing, think of a real problem: the choice of picking a restaurant:
    - Exploitation: You go to the same one that you know is good every day and take the risk to miss another better restaurant.
    - Exploration: Try restaurants you never went to before, with the risk of having a bad experience but the probable opportunity of a fantastic experience.

- This is what we call the exploration/exploitation trade-off. We need to balance how much we explore the environment and how much we exploit what we know about the environment.
    - Therefore, we must define a rule that helps to handle this trade-off. We’ll see the different ways to handle it in the future units.

### Two main approaches to solving the RL problems

- After taking a look at the RL framework 
    - RL process which consists of:
        - Observations or States Space
        - Action Space
        - Rewards and it's discounting
        - Tasks, i.e. an instance of reinforcement learning. Episodic or Continuing tasks
    - Reward Hypothesis: Every goal can be described as a maximization of the expected return.
    - Markov Property
    - Exploration vs Exploitation.
- We now have to see how this whole RL framework can be used to solve the RL problems
    - In other words, how do we build an RL agent that can select the actions that maximize it's expected cumulative rewards?

- Policy $\pi$
    - The brain of our agent defining the behaviour of our agent
    - Describes which action to take in which state
    - This is what we want to learn to solve the RL problem via the framework.
    - 2 Approaches to find the policy: Policy Based Methods and Value Based Methods
 

- Policy based methods:
    - Learn the policy function directly.
        - This function defines a mapping from each state to the best corresponding action or a probability distribution over the set of all the possible actions at that state
        - There are 2 types of policy, 
            - Deterministic policy: Given a state this policy returns the same action
            - Stochastic policy: Outputs a probability distribution over all the actions in a given state.

<img src ="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_1.jpg" width = 600 height = 600>

<img src ="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_2.jpg" width = 600 height = 600>

- Value based methods:
    - Indirect way of learning policy
    - Learn a value function, which maps a state to the expected value of being at that state.
        - Value of that state is the expected discounted return the agent can get if it starts in that state, and then acts according to our policy.
        - Act according to our policy just means that our policy is going to the state with the highest value.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_1.jpg" width = 600 height = 600>

<img src ="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_1.jpg" width = 600 height = 600>

<img src ="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_2.jpg" width = 600 height = 600>

### The Deep in Reinforcement learning

- Deep Reinforcement Learning introduces deep neural networks to solve Reinforcement Learning problems — hence the name “deep”.
- In the next unit, we’ll learn about two value-based algorithms: Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning.
    - In Q-learning approach, we use a traditional algorithm to create a Q table that helps us find what action to take for each state.
    - In Deep Q-learning approach, we will use a Neural Network (to approximate the Q value).

<img src ="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" width = 600 height = 600>

### Summary

- Reinforcement Learning is a computational approach of learning from actions. We build an agent that learns from the environment by interacting with it through trial and error and receiving rewards (negative or positive) as feedback.
- The goal of any RL agent is to maximize its expected cumulative reward (also called expected return) because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected cumulative reward.
- The RL process is a loop that outputs a sequence of state, action, reward and next state.
- To calculate the expected cumulative reward (expected return), we discount the rewards: the rewards that come sooner (at the beginning of the game) are more probable to happen since they are more predictable than the long term future reward.
- To solve an RL problem, you want to find an optimal policy. The policy is the “brain” of your agent, which will tell us what action to take given a state. The optimal policy is the one which gives you the actions that maximize the expected return.

- There are two ways to find your optimal policy:
    - By training your policy directly: policy-based methods.
    - By training a value function that tells us the expected return the agent will get at each state and use this function to define our policy: value-based methods.
    
- Finally, we speak about Deep RL because we introduce deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based) hence the name “deep”.

### Glossary

- Markov Property
    - It implies that the action taken by our agent is conditional solely on the present state and independent of the past states and actions.

- Observations/State
    - State: Complete description of the state of the world.
    - Observation: Partial description of the state of the environment/world.
- Actions
    - Discrete Actions: Finite number of actions, such as left, right, up, and down.
    - Continuous Actions: Infinite possibility of actions; for example, in the case of self-driving cars, the driving scenario has an infinite possibility of actions occurring.

- Rewards and Discounting
    - Rewards: Fundamental factor in RL. Tells the agent whether the action taken is good/bad.
    - RL algorithms are focused on maximizing the cumulative reward.

- Reward Hypothesis: RL problems can be formulated as a maximisation of (cumulative) return.
    - Discounting is performed because rewards obtained at the start are more likely to happen as they are more predictable than long-term rewards.

- Tasks
    - Episodic: Has a starting point and an ending point.
    - Continuous: Has a starting point but no ending point.

- Exploration v/s Exploitation Trade-Off
    - Exploration: It’s all about exploring the environment by trying random actions and receiving feedback/returns/rewards from the environment.
    - Exploitation: It’s about exploiting what we know about the environment to gain maximum rewards.
    - Exploration-Exploitation Trade-Off: It balances how much we want to explore the environment and how much we want to exploit what we know about the environment.

- Policy
    - Policy: It is called the agent’s brain. It tells us what action to take, given the state.
    - Optimal Policy: Policy that maximizes the expected return when an agent acts according to it. It is learned through training.
    - Policy-based Methods:
        - An approach to solving RL problems.
        - In this method, the Policy is learned directly.
        - Will map each state to the best corresponding action at that state. Or a probability distribution over the set of possible actions at that state.
    - Value-based Methods:
        - Another approach to solving RL problems.
        - Here, instead of training a policy, we train a value function that maps each state to the expected value of being in that state.