# Q-Learning

# 1 - What is Q-Learning

Q-Learning is an off-policy value-based method that uses a TD approach to train its action-value function. **Q-Learning is the algorithm we use to train our Q-function**, an **action-value** function that determines the value of being at a particular state and taking a specific action at that state.

<table>
    <tr>
        <td><img src="images_2/Q-function.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

Given a state and action, our Q Function outputs a state-action value (also called Q-value)

The **Q comes from "the Quality" (the value) of that action at that state**. Let's quickly recap the difference between value and reward:

* The value of a state, or a state-action pair is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.
* The reward is the feedback I get from the environment after performing an action at a state

**Internally, our Q-function has a Q-table, a tabe where each cell corresponds to a state-action pair value.** <span style="color:Blue"><b>Think of this Q-table as the memory or cheat-sheet of our Q-function.</b></span>. given a state and action, our Q-function will search inside its Q-table to output the value.

Consider the following "maze" and Q-table as an example:

<table>
    <tr>
        <td><img src="images_2/Maze-3.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

If we recap, <span style="color:Blue">Q-Learning</span> is the RL algorithm that:

* Trains a Q-function (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values**.
* Given a state and action, our Q-function will **search into its Q-table the corresponding value**.
* **When the training is done, we have an optimal Q-function, which means we have an optional Q-table**.
* And if we have an optimal Q-function, **we have an optimal policy since we know for each state what is the best action to take**.

But, in the beginning, our Q-table is useless since it gives arbitrary values for each state-action pair (most of the time, we initialize the Q-table to 0). As the agent explores the environment and we update the Q-table, it will give us better and better approximations to the optimal policy:

<table>
    <tr>
        <td><img src="images_2/Q-learning-1.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>


# 2 - The Q-Learning algorithm

Now that we understand what Q-Learning, Q-function, and Q-table are, let’s dive deeper into the Q-Learning algorithm.

<table>
    <tr>
        <td><img src="images_2/Q-learning-2.png" title="" alt="" width="600" data-align="center"></td>
    </tr>
</table>

#### Step 1: Initialize the Q-table (usually with values of 0)

<table>
    <tr>
        <td><img src="images_2/Q-learning-3.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

#### Step 2: Choose an action using epsilon-greedy strategy

<table>
    <tr>
        <td><img src="images_2/Q-learning-4.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

The idea is that we define the initial epsilon $\epsilon = 1.0$:

* With probability $1-\epsilon$: we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
* With probability $\epsilon$: we do **exploration** (aka our agent tries a random action).

At the beginning of the training, **the probability of doing exploration will be huge since $\epsilon$ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation.

<table>
    <tr>
        <td><img src="images_2/Q-learning-5.png" title="" alt="" width="200" data-align="center"></td>
    </tr>
</table>

#### Step 3: Perform action $A_{t}$, get reward $R_{t+1}$ and next state $S_{t+1}$

<table>
    <tr>
        <td><img src="images_2/Q-learning-6.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

#### Step 4: Update $Q(S_{t}, A_{t})$

Remember that in TD learning, we update our policy or value function (depending on the RL method we choose) **after one step of the interaction**. To produce our TD target, **we use the immediate reward $R_{t+1}$ plus the discounted value of the next best state-action pair** (we call that bootstrap). Thefore, our $Q(S_{t}, A_{t})$ update formula goes like this:

<table>
    <tr>
        <td><img src="images_2/Q-learning-8.png" title="" alt="" width="550" data-align="center"></td>
    </tr>
</table>

This means that to update our $Q(S_{t}, A_{t})$:
* We need $S_{t}, A_{t}, R_{t+1}, S_{t+1}$
* To update our Q-value at a given state-action pair, we use the TD target

How do we form the TD target?

1. We obtain the reward $R_{t+1}$ after taking the action.
2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.

Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**. This is why we say **Q-learning is an off-policy algorithm**.

# 3- Off-policy vs On-policy

* Off-policy. **Using a different policy for acting (inference) and updating (training)**. For instance, with Q-Learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that is used to select the best next-state action value to update our Q-value (updating policy). 

* On-policy. **Using the same policy for acting and updating**. For insance, with [Sarsa](https://en.wikipedia.org/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action), another value-based algorithm, the **epsilon-greedy policy selects the next state-action pair, not a greedy policy.**

<table>
    <tr>
        <td><img src="images_2/off-on-4.png" title="" alt="" width="600" data-align="center"></td>
    </tr>
</table>

# 4 - A Q-Learning 

To better understand Q-learning, let's take a simple example:

<table>
    <tr>
        <td><img src="images_2/q-ex-1.jpg" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>


The reward function goes like this:
* +0: Going to a state with no cheese in it
* +1: Going to a state with a small cheese in it
* +10: Going to the state with the big pile of cheese
* -10: Going to the state with the poison and thus die
* +0 If we spend more than five steps

To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**

## 4.1 - Timestep 0

#### Step 1: Initialize the Q-table

<table>
    <tr>
        <td><img src="images_2/Example-1.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>


So, for now, **our Q-table is useless; we need to train our Q-function using the Q-Learning algorithm**.

Let’s do it for 2 training timesteps:

## 4.2 - Timestep 1

#### Step 2: Choose action using Epsilon Greedy Strategy

<table>
    <tr>
        <td><img src="images_2/q-ex-3.jpg" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

Because epsilon is big = 1.0, it takes a random action, in this case, it goes right.

#### Step 3: Perform action $A_{t}$, gets $R_{t+1}$ and $S_{t+1}$

<table>
    <tr>
        <td><img src="images_2/q-ex-4.jpg" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

By going right, it gest a small cheese, so $R_{t+1} = 1$, and it is in a new state

#### Step 4: Update $Q(S_{t}, A_{t})$

<table>
    <tr>
        <td><img src="images_2/q-ex-5.jpg" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>


<table>
    <tr>
        <td><img src="images_2/Example-4.png" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

## 4.3 - Timestep 2

#### Step 2: Choose action using Epsilon Greedy Strategy

Since epsilon is still big (i.e., 0.99), it takes a random action again. It takes the "down" action, which is not good since it leads to the poison

<table>
    <tr>
        <td><img src="images_2/q-ex-6.jpg" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

#### Step 3: Perform action $A_{t}$, gets $R_{t+1}$ and $S_{t+1}$

<table>
    <tr>
        <td><img src="images_2/q-ex-7.jpg" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

#### Step 4: Update $Q(S_{t}, A_{t})$

<table>
    <tr>
        <td><img src="images_2/q-ex-8.jpg" title="" alt="" width="500" data-align="center"></td>
    </tr>
</table>

**Since the mouse is dead, we start a new episode**. But we can see that with two exploration steps, the agent became smarter.

As we continue exploring and exploiting the environment and updating Q-values using TD target, **the Q-table will give us better and better approximations. And thus, at the end of the training, we’ll get an estimate of the optimal Q-function**.

# 5 - Building a FrokenLake agent

We are going to train our agent with the Q-learning algorithm so that it learns to navigate the Frozen Lake environment.

* <a href="https://gymnasium.farama.org/environments/toy_text/frozen_lake/"><span style="color:blue"><b><i>FrozenLake</i></b> environment documentation</span></a>
* <a href="https://gymnasium.farama.org/tutorials/training_agents/FrozenLake_tuto/"><span style="color:blue"><b>Extended version of this example</b></span></a>

<table>
    <tr>
        <td><img src="images_2/frozen_lake.gif" title="" alt="" width="250" data-align="center"></td>
    </tr>
</table>

## 5.1 - Imports

We need to install a series of libraries that allow us to play the video game and obtain information regarding the environment (image and score), as well as apply actions on
it:


```
pip install gymnasium
pip install imageio
pip install imageio_ffmpeg
pip install gymnasium[toy-text]
```

In [6]:
import numpy as np
import gymnasium as gym
import random
import imageio
import os
import tqdm

from tqdm.notebook import tqdm

## 5.2 - Description

The game starts with the player at location [0,0] of the frozen lake grid world with the goal located at far extent of the world (e.g., [3,3] for the 4x4 environment).

* Holes in the ice are distributed in set locations when using a pre-determined map or in random locations when a random map is generated.

* The player makes moves until they reach the goal or fall in a hole.

* The lake is slippery (unless disabled) so the player may move perpendicular to the intended direction sometimes (see is_slippery).

* Randomly generated worlds will always have a path to the goal.

### 5.2.1 - Environment

The environment has 4 types of states:
* **S**: The initial state.
* **G**: the target state.
* **F**: states through which you can walk.
* **H**: "hole" states that should be avoided.

The size of the map is given by its name:
* `map_name="4x4"`: a map of size 4x4.
* `map_name="8x8"`: a map of size 8x8.

The environment has two modes:
* `is_slippery=False`. The agent always moves in the desired direction (deterministic).
* `is_slippery=True`. The agent does not always move in the desired direction due to the slippery nature of the ground (stochastic).

### 5.2.2 - Action space

The action shape is (1,) in the range {0, 3} indicating which direction to move the player.

* 0: Move left
* 1: Move down
* 2: Move right
* 3: Move up


### 5.2.3 - Rewards

Reward schedule:

* Reach **goal** (**G**): +1
* Reach **hole** (**H**): 0
* Reach **frozen** (**F**): 0


### 5.2.4 - Episode End

The episode ends if the following happens:

* **Termination**:
  * The player moves into a hole.
  * The player reaches the goal at max(nrow) * max(ncol) - 1 (location [max(nrow)-1, max(ncol)-1]).
<br></br>
* **Time limit** (when using the `time_limit` wrapper):
  * The length of the episode is 100 for 4x4 environment, 200 for 8x8 environment.


## 5.3 - Initialize the environment

In [7]:
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)

That way we are loading the default map. We could also manually create our own map the following way:

In [8]:
desc=["SFFF", "FHFH", "FFFH", "HFFG"]
gym.make('FrozenLake-v1', desc=desc, is_slippery=True)

<TimeLimit<OrderEnforcing<PassiveEnvChecker<FrozenLakeEnv<FrozenLake-v1>>>>>

In [9]:
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample())

_____OBSERVATION SPACE_____ 

Observation Space Discrete(16)
Sample observation 10


## 5.4 - Initialize Q-table

In [10]:
state_space = env.observation_space.n
print("There are ", state_space, " possible states")

action_space = env.action_space.n
print("There are ", action_space, " possible actions")

# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
def initialize_q_table(state_space, action_space):
    Qtable = np.zeros((state_space, action_space))
    return Qtable

Qtable_frozenlake = initialize_q_table(state_space, action_space)
Qtable_frozenlake

There are  16  possible states
There are  4  possible actions


array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

## 5.5 - Define policies

Remember that in Q-learning we use different policies to **act** and to **update our value function**:
* **Act**: $\epsilon$-greedy policy 
* **Update**: greedy policy

In [11]:
def greedy_policy(Qtable, state):
  # Exploitation: take the action with the highest state, action value
  action = np.argmax(Qtable[state][:])
  
  return action

def epsilon_greedy_policy(Qtable, state, epsilon):
  # Randomly generate a number between 0 and 1
  random_int = random.uniform(0,1)
  # if random_int > greater than epsilon --> exploitation
  if random_int > epsilon:
    # Take the action with the highest value given a state
    # np.argmax can be useful here
    action = greedy_policy(Qtable, state)
  # else --> exploration
  else:
    action = env.action_space.sample()
  
  return action

## 5.6 - Define hyperparameters

The exploration related hyperparamters are some of the most important ones. 

- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.

In [12]:
# Training parameters
n_training_episodes = 10000  # Total training episodes
learning_rate = 0.7          # Learning rate

# Evaluation parameters
n_eval_episodes = 100        # Total number of test episodes

# Environment parameters
env_id = "FrozenLake-v1"     # Name of the environment
max_steps = 99               # Max steps per episode
gamma = 0.95                 # Discounting rate
eval_seed = []               # The evaluation seed of the environment

# Exploration parameters
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.05            # Minimum exploration probability 
decay_rate = 0.0005            # Exponential decay rate for exploration prob

## 5.7 - Training

In [13]:
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
  for episode in tqdm(range(n_training_episodes)):
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    # Reset the environment
    state, info = env.reset()
    step = 0
    terminated = False
    truncated = False

    # repeat
    for step in range(max_steps):
      # Choose the action At using epsilon greedy policy
      action = epsilon_greedy_policy(Qtable, state, epsilon)

      # Take action At and observe Rt+1 and St+1
      # Take the action (a) and observe the outcome state(s') and reward (r)
      new_state, reward, terminated, truncated, info = env.step(action)

      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
      Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])   

      # If terminated or truncated finish the episode
      if terminated or truncated:
        break
      
      # Our next state is the new state
      state = new_state
  return Qtable

In [14]:
Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)

  0%|          | 0/10000 [00:00<?, ?it/s]

In [15]:
Qtable_frozenlake

array([[0.73509189, 0.77378094, 0.77378094, 0.73509189],
       [0.73509189, 0.        , 0.81450625, 0.77378094],
       [0.77378094, 0.857375  , 0.77378094, 0.81450625],
       [0.81450625, 0.        , 0.77378094, 0.77378094],
       [0.77378094, 0.81450625, 0.        , 0.73509189],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.        , 0.81450625],
       [0.        , 0.        , 0.        , 0.        ],
       [0.81450625, 0.        , 0.857375  , 0.77378094],
       [0.81450625, 0.9025    , 0.9025    , 0.        ],
       [0.857375  , 0.95      , 0.        , 0.857375  ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.95      , 0.857375  ],
       [0.9025    , 0.95      , 1.        , 0.9025    ],
       [0.        , 0.        , 0.        , 0.        ]])

## 5.8 - Evaluation

Very simple evaluation to verify that the agent has learned to correctly reach the target state in the environment that we have prepared for it.

In [16]:
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
    
    episode_rewards = []
    for episode in tqdm(range(n_eval_episodes)):
        if seed:
            state, info = env.reset(seed=seed[episode])
        else:
            state, info = env.reset()
        step = 0
        truncated = False
        terminated = False
        total_rewards_ep = 0

        for step in range(max_steps):
            # Take the action (index) that have the maximum expected future reward given that state
            action = greedy_policy(Q, state)
            new_state, reward, terminated, truncated, info = env.step(action)
            total_rewards_ep += reward

            if terminated or truncated:
                break
            state = new_state
        episode_rewards.append(total_rewards_ep)
        
    mean_reward = np.mean(episode_rewards)
    std_reward = np.std(episode_rewards)

    return mean_reward, std_reward

In [17]:
# Evaluamos nuestro agente con 100 episodios de prueba
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

  0%|          | 0/100 [00:00<?, ?it/s]

Mean_reward=1.00 +/- 0.00
