---
title: "RL Unit 2: Introduction to Q Learning"
description: "Unit 2 Learnings from Hugging Face RL Course"
format:
    html:
        code-fold: true
render-on-save: true
execute:
    eval: false
    echo: true
jupyter: python3
output:
  quarto::html_document:
    self_contained: false
    keep_md: false

categories:
    - Re-inforcement Learning
    - Regression Project
image: ./images/RL2_QLearning.jpg
---

<img src = "" width = 600 height = 600>

## Chapter 3: Deep Q-Learning

- Back in previous Q-learning unit:
    - We implemented a Q-learning algorithm from scratch and trained it on Taxi-v3 and FrozenLake-v1 env's
    - We got excellent results with this simple algorithm, but these environments were relatively simple because the state space was discrete and small
    - However, we need to work on a bit complex problems as well, such as Atari games which has $10^9$ to $10^{11}$ states
    - In such huge state space, producing and updating a Q-table can become ineffective
    - Thus we will use Deep Q-Learning, uses Neural Network that takes a state and approximates Q-values for each action based on that state
    - In this unit, we will train an agent to play Space Invaders and other Atari environments using RL-Zoo, a training framework for RL using Stable-Baselines that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
    

### From Q-Learning to Deep Q-Learning

- Q-Learning is an algorithm we use to train our Q-function, an action value function that determines the value of being at a particular state and taking a specific action at that state.
- Q here stands for quality of that action at that state, internally Q-function, is encoded by a Q-table, a table where each cell corresponds to a state-action pair value. Q-table serves as the memory of our Q-function
- However, Q-learning is a tabular method, this is fine for small state space, but if state space becomes large Q-learning is not scalable to such problems.
- For ex. Atari environments have an observation space with a shape of (210,160,3)* containing values of 0 to 255 this gives us N = $256^{210x160x3}$ possible observations (for comparision we have approx $10^{80}$ atoms in the observable universe)
- So overall we can say that we will have a Q-table of N by A, and that's again huge

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari.jpg" width = 600 height = 600>

- Thus we can see that the state space is gigantic, due to this, creating and updating a Q-table for that environment would not be efficient.
- In this case, the best idea is to approximate Q-values using a *parameterized Q-function $Q_{\theta}(s,a)$*
- We will use a neural network that approximates Q-values for a given state, for each possible action at that state

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" width = 600, height = 600 >

### The Deep Q-Network (DQN)

- This is the architecture of our Deep Q-Learning network:

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" width = 600 height = 600>

- As input, we take a stack of 4 frames passed through the network as a state and output a vector of Q-values for each possible action at that state.
- Then, like with Q-Learning, we just need to use our epsilon-greedy policy to select which action to take
- When the Neural Network is initialized, the Q-value estimation is terrible, but during training, our Deep Q-Network agent will associate a situation with the appropriate action and learn to play the game well.

#### Preprocessing the input and temporal limitation

- We need to preprocess the input, it's an essential step since we would like to reduce the complexity of our state to reduce the computation time needed for training
- To achieve this, we reduce the state space to 84x84 and grayscale it. We can do this since the colors in Atari environments, don't add important information. This is a big improvement since we reduce our three color channels (RGB) to 1.
- We can also crop a part of this scren in some games if it doesn't add any crucial information. Then we stack 4 frames together. 
- This stacking is necessary since it helps us to handle the problem of temporal limitation. Basically having a single frame doesn't give us any idea about motion, however if we stack more frames we capture temporal information.

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/preprocessing.jpg" width = 600 height = 600><img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" width = 600 height = 600>

- These stacked frames are processed by 3 convolutional layers. These layers allow us to capture and exploit spatial relationships in images. But also, because the frames are stacked together, we can exploit some temporal properties across those frames. Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" width = 600 height = 600><img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" width = 600 height = 600>

- **So we can basically see that Deep Q-learning given a state, uses a neural network to approximate, the different Q-values for each possible action at that state.**

### The Deep Q-Learning Algorithm

- Now we know Deep Q-Learning uses a deep neural network to approximate the different Q-values for each possible action at a state (value-function estimation)
- The main difference between Q-Learning and Deep Q-Learning is that during **training phase, instead of updating the Q-value of a state-action pair directly** as we have done with Q-Learning, in Deep Q-Learning, we create a **loss function that compares our Q-value prediction and the Q-target and uses gradient descent to update the weights of our Deep Q-Network to approximate our Q-values better**
*So we need to have Q-targets and Q-value predictions*

<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" height = 600 width = 600><img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" height = 600 width = 600>

- The Deep Q-Learning training algorithm has 2 phases:
    - Sampling: we perform actions and store the observed experience tuples in a replay memory
    - Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step
    
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/sampling-training.jpg" width = 600 height = 600>

- This is not the only difference compared with Q-Learning. 
- Deep Q-Learning training **might suffer from instability,** mainly because of combining a non-linear Q-value function (Neural Network) and bootstrapping (when we update targets with existing estimates and not an actual complete return)

- To help us stabilize the training, we implement 3 different solutions:
    - Experience Replay to make more efficient use of experiences
    - Fixed Q-Target to stabilize the training
    - Double Deep Q-Learning, to handle the problem of the overestimation of Q-values
    
- Let's go through them!

#### Experience Replay to make more efficient use of experiences

- Why to create a replay memory?
    - Experience  Replay in Deep Q-Learning has 2 functions:
        - **1. Make more efficient use of the experiences during the training.** Usually, in online re-inforcement learning, the agent interacts with the environment, gets experience (state, actoin, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient
            - Experience replay helps by using the experiences of the training more efficiently. We use a **replay buffer** that saves experience samples **that we can reuse during the training**
            - This allows the agent to learn from the **same experiences multiple times**.

        - **2. Avoid forgetting previous experiences and reduce the correlation between experiences.**
            - The problem we get if we give sequential samples of experiences to our neural network is that it **tends to forget the previous experiences as it gets new experiences.** For instance, if the agent is in the first level and then in the second, which is different, it can forget how to behave and play in the first level.
            
            - The solution is to create a Replay Buffer that stores experience tuples while interacting with the environment and then sample a small batch of tuples. This prevents **the network from only learning about what it has done immediately before**
    
    - Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid action values from **Oscilating or Diverging catastrophically**
    - In Deep Q-Learning psuedocode, we initialize a replay memory buffer D with capacity N (N is a hyperparameter that you can define). We then store experiences in the memory and sample a batch of experiences to feed the Deep Q-Network during the training phase.

- <img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay-pseudocode.jpg" width = 600 height = 600>

#### Fixed Q-Target to stabilize the training

- When we want to calculate the TD error (aka the loss), we calculate the **difference between the TD target (Q-Target) and the current Q-value (estimation of Q)**
- But we don't have any **idea of the real TD target.** We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q-value for the next state.

- However, the problem is that we are using the same parameters (weights) for estimating the TD target and the Q-value. Consequently, there is a **significant correlation** between the TD target and the parameters we are changing. Therefore, **at every step of training, both our Q-values and the target values shift**. We're getting closer to our target, but the target is also moving. It's like chasing a moving target! This can lead to significant oscillation in training.

     For ex.
    - It's like if you were a cowboy (the Q estimation) and you wanted to catch a cow (the Q-target). Your goal is to get closer (reduce the error).
    - At each time step, you're trying to approach the cow, which also moves at each time step (because you move the same parameters) 
    - This leads to a bizzare path of chasing (a significant oscillating in training)
    - Instead, what we see in the psuedo-code is that we:
        - Use a separate network with fixed parameters for estimating the TD Target
        - Copy the parameters from our Deep Q-Network every C steps to update the target network
        
<img src = "https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/fixed-q-target-pseudocode.jpg" width = 600 height = 600>
