# **Q-Learning con Frozen Lake ‚õÑ (non slippery version)**

*Notebook riadattato dagli esempi di: Thomas Simonini (Hugging Face)*

In this lab, we're going to train a Reinforcement Learning agent to play ***Frozen Lake***, using the Q-Learning algorithm.

### üëÄ Game overview

<img src="https://media.licdn.com/dms/image/D5622AQGQvw8eta9usg/feedshare-shrink_2048_1536/0/1689777707403?e=1701907200&v=beta&t=-lfBdSVX2VieHmADRJ_flvyvJDkGrpRHJQr0JX1yW-w" alt="FrozenLake"/>

###üéÆ Environment:

- [FrozenLake-v1](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)

###üìö Libraries:

- Python and NumPy
- [Gymnasium](https://gymnasium.farama.org/) -> library with sample envoronments to practice with Reinforcement Learning (originally by OpenAI)


## Install dependencies

- `gymnasium`: Contains the FrozenLake-v1 ‚õÑ environment (and other cool environments to try!).
- `pygame`: Used for the FrozenLake-v1 UI.
- `numpy`: Used for handling our Q-table.

In [None]:
#environment:
!pip install gymnasium
!pip install pygame

#progress bar:
!pip install tqdm

#for video:
!pip install imageio
!pip install imageio_ffmpeg



## Import dependencies

In [None]:
import numpy as np
import gymnasium as gym
import os, random
from tqdm.notebook import tqdm
import imageio   #for video
from IPython.display import Video   #for video
from google.colab import files   #for video

## Create and understand the FrozenLake environment ‚õÑ



*Docs available here ->* https://gymnasium.farama.org/environments/toy_text/frozen_lake/


We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.

We can have two sizes of environment:

- `map_name="4x4"`: a 4x4 grid version
- `map_name="8x8"`: a 8x8 grid version


The environment has two modes:

- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).

For now let's keep it simple with the 4x4 map and non-slippery.
We add a parameter called `render_mode` that specifies how the environment should be visualised. In our case because we **want to record a video of the environment at the end, we need to set render_mode to rgb_array**.

As [explained in the documentation](https://gymnasium.farama.org/api/env/#gymnasium.Env.render) ‚Äúrgb_array‚Äù: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.

In [None]:
# Create the FrozenLake-v1 environment using chosen map and non-slippery version and render_mode="rgb_array"
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False, render_mode="rgb_array")

You can create your own custom grid like this:

```python
desc=["SFFF", "FHFH", "FFFH", "HFFG"]
gym.make('FrozenLake-v1', desc=desc, is_slippery=True)
```

but we'll use the default environment for now.

### Let's see what the Environment looks like:


In [None]:
# We create our environment with gym.make("<name_of_the_environment>")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample()) # Get a random observation

_____OBSERVATION SPACE_____ 

Observation Space Discrete(16)
Sample observation 9


We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent‚Äôs current position as current_row * ncols + current_col (where both the row and col start at 0)**.

For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**


For instance, this is what state = 0 looks like:

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/frozenlake.png" alt="FrozenLake">

In [None]:
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action


 _____ACTION SPACE_____ 

Action Space Shape 4
Action Space Sample 3


The action space (the set of possible actions the agent can take) is discrete with 4 actions available:
- 0: GO LEFT
- 1: GO DOWN
- 2: GO RIGHT
- 3: GO UP

Reward function:
- Reach goal: +1
- Reach hole: 0
- Reach frozen: 0

## Create and Initialize the Q-table üóÑÔ∏è

To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. We already know their values from before, but we'll want to obtain them programmatically so that our algorithm generalizes for different environments. Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`


In [None]:
state_space = env.observation_space.n
print("There are ", state_space, " possible states")

action_space = env.action_space.n
print("There are ", action_space, " possible actions")

There are  16  possible states
There are  4  possible actions


In [None]:
# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
def initialize_q_table(state_space, action_space):
  Qtable = np.zeros((state_space, action_space))
  return Qtable

In [None]:
Qtable_frozenlake = initialize_q_table(state_space, action_space)

## Define the updating policy

Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.

- Epsilon-greedy policy (acting policy)
- Greedy-policy (updating policy)

The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table.


In [None]:
def greedy_policy(Qtable, state):
  # Exploitation: take the action with the highest state, action value
  action = np.argmax(Qtable[state][:])

  return action

##Define the acting policy

Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.

The idea with epsilon-greedy:

- With *probability 1 - …õ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).

- With *probability …õ*: we do **exploration** (trying a random action).

As the training continues, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**


In [None]:
def epsilon_greedy_policy(Qtable, state, epsilon):
  # Randomly generate a number between 0 and 1
  random_num = random.uniform(0,1)
  # if random_num > greater than epsilon --> exploitation
  if random_num > epsilon:
    # Take the action with the highest value given a state
    # np.argmax can be useful here
    action = greedy_policy(Qtable, state)
  # else --> exploration
  else:
    action = env.action_space.sample()

  return action

## Define the hyperparameters

The exploration related hyperparamters are some of the most important ones.

- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.

In [None]:
# Training parameters
n_training_episodes = 10000  # Total training episodes
learning_rate = 0.7          # Learning rate

# Evaluation parameters
n_eval_episodes = 100        # Total number of test episodes

# Environment parameters
env_id = "FrozenLake-v1"     # Name of the environment
max_steps = 99               # Max steps per episode
gamma = 0.95                 # Discounting rate
eval_seed = []               # The evaluation seed of the environment

# Exploration parameters
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.05            # Minimum exploration probability
decay_rate = 0.0005           # Exponential decay rate for exploration probability

## Create the training loop method

The training loop goes like this:

```
For episode in the total of training episodes:

Reduce epsilon (since we need less and less exploration)
Reset the environment

  For step in max timesteps:    
    Choose the action At using epsilon greedy policy
    Take the action (a) and observe the outcome state(s') and reward (r)
    Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
    If done, finish the episode
    Our next state is the new state
```

In [None]:
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable, video_path, fps=8, max_video_mins=3):
  # Init video frames array:
  images = []
  max_frames = max_video_mins * 60 * fps   #max video length in frames
  # Init stats:
  episode_steps = []
  episode_rewards = []

  for episode in tqdm(range(n_training_episodes)):
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    # Reset the environment
    state, info = env.reset()
    step = 0
    total_rewards_ep = 0
    terminated = False
    truncated = False

    # Save beginning frame:
    if len(images) <= max_frames:
      images.append(env.render())

    # repeat
    for step in range(max_steps):
      # Choose the action At using epsilon greedy policy
      action = epsilon_greedy_policy(Qtable, state, epsilon)

      # Take action At and observe Rt+1 and St+1
      # Take the action (a) and observe the outcome state(s') and reward (r)
      new_state, reward, terminated, truncated, info = env.step(action)
      total_rewards_ep += reward

      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
      Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])

      # Save updated frame:
      if len(images) <= max_frames:
        images.append(env.render())

      # If terminated or truncated finish the episode
      if terminated or truncated:
        break

      # Our next state is the new state
      state = new_state

    # END OF EPISODE: store episode results:
    episode_steps.append(step)
    episode_rewards.append(total_rewards_ep)

  # END OF TRAINING: calculate performance:
  mean_reward = np.mean(episode_rewards)
  std_reward = np.std(episode_rewards)
  mean_steps = np.mean(episode_steps)
  std_steps = np.std(episode_steps)
  print("TRAINING RESULTS:")
  print(f"Mean reward: {mean_reward:.2f}, std: +/- {std_reward:.2f}")
  print(f"Mean steps: {mean_steps:.1f}, std: +/- {std_steps:.1f}")

  # Save training video:
  print("\nSaving output video...")
  imageio.mimsave(video_path, [np.array(img) for i, img in enumerate(images)], fps=fps)
  print("Video saved.")

  return Qtable

## Training

In [None]:
training_vpath = "./frozenlake_training.mp4"
Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake, training_vpath)

  0%|          | 0/10000 [00:00<?, ?it/s]

TRAINING RESULTS:
Mean reward: 0.74, std: +/- 0.44
Mean steps: 5.5, std: +/- 2.4

Saving output video...
Video saved.


Play training video:

In [None]:
Video(training_vpath, embed=True)

## Final Q-table

In [None]:
Qtable_frozenlake

array([[0.73509189, 0.77378094, 0.77378094, 0.73509189],
       [0.73509189, 0.        , 0.81450625, 0.77378094],
       [0.77378094, 0.857375  , 0.77378094, 0.81450625],
       [0.81450625, 0.        , 0.77378094, 0.77378094],
       [0.77378094, 0.81450625, 0.        , 0.73509189],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.        , 0.81450625],
       [0.        , 0.        , 0.        , 0.        ],
       [0.81450625, 0.        , 0.857375  , 0.77378094],
       [0.81450625, 0.9025    , 0.9025    , 0.        ],
       [0.857375  , 0.95      , 0.        , 0.857375  ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.9025    , 0.95      , 0.857375  ],
       [0.9025    , 0.95      , 1.        , 0.9025    ],
       [0.        , 0.        , 0.        , 0.        ]])

## Evaluation function

In [None]:
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
  """
  Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
  :param env: The evaluation environment
  :param max_steps: Maximum number of steps per episode
  :param n_eval_episodes: Number of episode to evaluate the agent
  :param Q: The Q-table
  :param seed: The evaluation seed array (for taxi-v3)
  """
  episode_steps = []
  episode_rewards = []
  for episode in tqdm(range(n_eval_episodes)):
    if seed:
      state, info = env.reset(seed=seed[episode])
    else:
      state, info = env.reset()
    step = 0
    truncated = False
    terminated = False
    total_rewards_ep = 0

    for step in range(max_steps):
      # Take the action (index) that have the maximum expected future reward given that state
      action = greedy_policy(Q, state)
      new_state, reward, terminated, truncated, info = env.step(action)
      total_rewards_ep += reward

      if terminated or truncated:
        break
      state = new_state

    #store episode results:
    episode_steps.append(step)
    episode_rewards.append(total_rewards_ep)

  #calculate performance:
  mean_reward = np.mean(episode_rewards)
  std_reward = np.std(episode_rewards)
  mean_steps = np.mean(episode_steps)
  std_steps = np.std(episode_steps)

  return mean_reward, std_reward, mean_steps, std_steps

## Evaluate our Q-Learning agent

- Usually, you should have a mean reward of 1.0
- The **environment is relatively easy** since the state space is really small (16). What you can try to do is [to replace it with the slippery version](https://gymnasium.farama.org/environments/toy_text/frozen_lake/), which introduces stochasticity, making the environment more complex.

In [None]:
# Evaluate our Agent
mean_reward, std_reward, mean_steps, std_steps = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)

print("EVALUATION RESULTS:")
print(f"Mean reward: {mean_reward:.2f}, std: +/- {std_reward:.2f}")
print(f"Mean steps: {mean_steps:.1f}, std: +/- {std_steps:.1f}")

  0%|          | 0/100 [00:00<?, ?it/s]

EVALUATION RESULTS:
Mean reward: 1.00, std: +/- 0.00
Mean steps: 5.0, std: +/- 0.0


## Record the final video

In [None]:
def record_video(env, Qtable, out_directory, fps=1):
  """
  Generate a replay video of the agent
  :param env
  :param Qtable: Qtable of our agent
  :param out_directory
  :param fps: how many frame per seconds (with frozenlake-v1 we use 1)
  """
  images = []
  terminated = False
  truncated = False
  state, info = env.reset(seed=random.randint(0,500))
  img = env.render()
  images.append(img)
  while not terminated or truncated:
    # Take the action (index) that have the maximum expected future reward given that state
    action = np.argmax(Qtable[state][:])
    state, reward, terminated, truncated, info = env.step(action) # We directly put next_state = state for recording logic
    img = env.render()
    images.append(img)
  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)

In [None]:
# Record
final_vpath = "./frozenlake_final.mp4"
record_video(env, Qtable_frozenlake, final_vpath)
Video(final_vpath, embed=True)

Download videos:

In [None]:
files.download(training_vpath)
files.download(final_vpath)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>