This notebook has been inspired from [StableBaselines3 RL Colab Notebooks](https://github.com/Stable-Baselines-Team/rl-colab-notebooks)

#Part 1: Stable Baselines 3

__[Stable Baselines3 (SB3)](https://stable-baselines3.readthedocs.io/en/master/)__ is a set of reliable implementations of reinforcement learning algorithms in PyTorch.

In this notebook, you will learn the basics for using stable baselines3 library.

In [1]:
# Install SWIG, a tool used to connect C/C++ code with Python. It's often used in RL environments to enable efficient communication between Python and low-level implementations of algorithms
!pip install -q swig

# Install the gym library with the box2d environment, used for 2D physics-based simulation tasks
!pip install -q gym[box2d]

# Install stable-baselines3 with extra dependencies (needed for various environments and features in the library), a set of RL algorithms implemented in PyTorch
!pip install stable-baselines3[extra]

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.4/374.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
Collecting stable-baselines3[extra]
  Downloading stable_baselines3-2.3.2-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.3/182.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium<0.30,>=0.28.1 (from stable-baselines3[extra])
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m42.4 MB/s[0m eta [36m0:00:00[0m
Collecting shimmy[atari]~=1.3.0 (fro

## Imports

Stable-Baselines works on environments that follow the __[gym interface](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html)__.


In [2]:
# Import the gymnasium library as gym, which provides various environments for developing and testing RL algorithms
import gymnasium as gym
import numpy as np

The first thing you need to import is the RL model (in our case __[PPO](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html)__).

In [11]:
# Import PPO (Proximal Policy Optimization) algorithm from stable-baselines3. PPO is a policy-gradient method and DQN is a value-based method
from stable_baselines3 import PPO

The next thing you need to import is the __[policy class](https://stable-baselines.readthedocs.io/en/master/modules/policies.html)__ that will be used to create the networks (for the policy/value functions).

In [12]:
# Import the MlpPolicy (Multi-Layer Perceptron) policy classe. MlpPolicy is a policy class that uses MLP for function approximation. CnnPolicy is a policy class that uses CNN for function approximation, typically used for image-based inputs
from stable_baselines3.dqn import MlpPolicy

# Import make_vec_env utility function to create vectorized environments for parallel execution of multiple environment instances. This can speed up training by allowing multiple agents to interact with their environments simultaneously
from stable_baselines3.common.env_util import make_vec_env

### Instantiate Environment

Let's consider the __[Lunar Lander](https://www.gymlibrary.dev/environments/box2d/lunar_lander/)__ environment.

In [None]:
# Parallel Environments: make_vec_env is used to create multiple instances of the environment to be run in parallel. This allows the agent to collect more experience in less time, leading to faster training
vec_env = make_vec_env(
    "LunarLander-v2",          # The name of the environment to create (LunarLander-v2 in this case, from OpenAI's Gym)
    n_envs=4,                  # Number of parallel environments to create
    wrapper_class=gym.wrappers.TimeLimit,  # The wrapper class (TimeLimit wrapper) will be applied to each environment. This wrapper is used to set a maximum number of steps per episode.  A wrapper allows us to modify the behavior of an environment without altering its core implementation
    wrapper_kwargs={"max_episode_steps": 500}  # Keyword (additional) arguments for the wrapper (limit episodes to 500 steps). This prevents episodes from running indefinitely and ensures consistency in episode length
)

### Instantiate Agent

In [None]:
# Create the PPO model with the MLP policy. PPO is a RL algorithm known for its stability and efficiency
model = PPO("MlpPolicy", vec_env, verbose=1) # Initialize the PPO algorithm with the MLP policy (policy network will be MLP, i.e. the NN will consist of fully connected layers), using the vectorized environment created above, and set verbosity to 1 to print basic information

Using cuda device


### Training the Agent

Please, check the __[documentation](https://stable-baselines3.readthedocs.io/en/v2.0.0/modules/ppo.html)__ to know what information's printed in the output

In [None]:
# Train the model
model.learn(total_timesteps=100000) # Train the PPO model on the environment for 100,000 timesteps

  and should_run_async(code)


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 92.5     |
|    ep_rew_mean     | -169     |
| time/              |          |
|    fps             | 1377     |
|    iterations      | 1        |
|    time_elapsed    | 5        |
|    total_timesteps | 8192     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 90.8        |
|    ep_rew_mean          | -138        |
| time/                   |             |
|    fps                  | 958         |
|    iterations           | 2           |
|    time_elapsed         | 17          |
|    total_timesteps      | 16384       |
| train/                  |             |
|    approx_kl            | 0.006861454 |
|    clip_fraction        | 0.0373      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.38       |
|    explained_variance   | 0.000952    |
|    learning_rate        | 0.

<stable_baselines3.ppo.ppo.PPO at 0x7a79208358d0>

### Explanation of the output

#### Rollout Phase

The "rollout phase" refers to the process of collecting experience data by running the agent (or policy) in the environment. During this phase, the agent interacts with the environment to generate episodes, which consist of sequences of states, actions, rewards, and next states. These episodes are used to gather information about how the agent performs given its current policy.

Here's a more detailed breakdown of what happens during the rollout phase:

**Initialization:**

- The environment is reset to an initial state.
- The agent's policy is initialized or updated based on the current parameters.

**Interaction with the Environment:**

- The agent observes the current state of the environment.
- Based on the current state, the agent selects an action using its policy.
- The selected action is executed in the environment.
- The environment transitions to a new state and provides a reward based on the action taken.

**Data Collection:**

- The state, action, reward, and next state are recorded.
- This process is repeated until an episode ends (e.g., when the agent reaches a terminal state or a maximum number of steps is reached).

**End of Episode:**

- The cumulative reward and other statistics for the episode are recorded.
- The environment is reset, and a new episode begins.
- This process continues for a predefined number of episodes or timesteps.

**Policy Update:**

- Once sufficient data is collected, it is used to update the policy.
- Depending on the algorithm, this may involve computing gradients and updating the policy parameters to maximize the expected reward.

The rollout phase is crucial for gathering the data needed to improve the agent's policy. It provides real experience from the environment, which the agent uses to learn and adapt. The quality and diversity of the data collected during the rollout phase significantly impact the effectiveness of the learning process.

#### Clipping

In PPO algorithms, "clipping" refers to a technique used to limit the change in the policy update to prevent it from diverging too far from the current policy. This helps ensure more stable and reliable learning.

Here's a detailed explanation of the clipping mechanism in PPO:


**Background**

In PPO, we aim to maximize the expected reward by adjusting the policy. This involves updating the policy parameters to increase the probability of actions that lead to higher rewards. However, if the updates are too large, the policy can change drastically, leading to instability and poor performance.


**Clipping in PPO**

To address this issue, PPO introduces a clipping mechanism that constrains the policy updates. This is done using a surrogate objective function, which includes a clipped version of the probability ratio between the new policy and the old policy. The probability ratio is defined as:
$r(\theta) = \frac{\pi_\theta(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)}$
where:
- $\pi_\theta(a \mid s)$ is the probability of taking action $a$ in state $s$ under the new policy with parameters $\theta$.
- $\pi_{\theta_{\text{old}}}(a \mid s)$ is the probability of taking action $a$ in state $s$ under the old policy with parameters $\theta_{\text{old}}$.


**Clipped Surrogate Objective**

The clipped surrogate objective is defined as:
$L^{\text{CLIP}}(\theta) = \mathbb{E} \left[ \min \left( r(\theta) \widehat{A}, \text{clip}(r(\theta), 1 - \epsilon, 1 + \epsilon) \widehat{A} \right) \right]$
where:
- $\widehat{A}$ is the advantage estimate.
- $\epsilon$ is a small hyperparameter (e.g., 0.2) that defines the clipping range.
- $\text{clip}(r(\theta), 1 - \epsilon, 1 + \epsilon)$ limits $r(\theta)$ to the range $[1 - \epsilon, 1 + \epsilon]$.


**Explanation**

- **Without Clipping:** The objective function would directly use $r(\theta) \widehat{A}$. Large updates to the policy could lead to a high variance in the training process, causing instability.
- **With Clipping:** The objective function uses the minimum of the unclipped and clipped terms. The clipping ensures that the ratio $r(\theta)$ does not deviate too far from $1$, thus preventing large updates to the policy.

By using this clipped objective, PPO limits the extent to which the policy can change in a single update, providing a balance between exploration (trying new actions) and exploitation (sticking to known good actions).

**Policy Clip Statistics**

In the training output, the `clip_fraction` represents the fraction of updates where the clipping was applied. This statistic helps monitor how often the clipping mechanism is being triggered, which can provide insights into the stability and effectiveness of the learning process. And `clip_range` defines the acceptable bounds for the policy update ratio. If `clip_range` is $0.2$, then the policy update ratio $r(\theta)$ is clipped to the range $[0.8, 1.2]$. It is designed to prevent excessively large updates to the policy, promoting stability.

For example:
- A high `clip_fraction` might indicate that the policy updates are frequently being constrained, which could suggest that the learning rate is too high or that the policy is changing too rapidly.
- A low `clip_fraction` suggests that the policy updates are mostly within the acceptable range, indicating more stable learning.
- A larger `clip_range` allows for more significant changes in the policy update ratio, which can lead to more aggressive updates. This can be useful in some scenarios but can also lead to instability and divergence in training if the updates are too aggressive.
- A smaller `clip_range` provides more conservative updates, promoting stability but potentially slowing down learning if the updates are too restrictive.

By keeping the policy changes within a controlled range, PPO achieves a good trade-off between improving the policy and maintaining stability during training.

#### Explained Variance

Explained Variance is a statistical measure used to quantify how well a model's predictions explain the variability of the observed data. In PPO, explained variance is used to evaluate the performance of the value function (also known as the critic).

Explained variance is defined as follows:

$$\text{Explained Variance} = 1 - \frac{\text{Var}(V(s) - R)}{\text{Var}(R)}$$

where:

- $V(s)$ is the value function's prediction of the expected return for state $s$, i.e. it is the agent's estimate of the expected return starting from state $s$ and following the current policy $\pi$ thereafter.
- $R$ is the actual return (sum of rewards) received, i.e. it is the total accumulated reward that the agent receives starting from a particular state until the end of the episode (lunar module either lands successfully, crashes, or the episode reaches its maximum length).
- $\text{Var}(V(s) - R)$ is the variance of the difference between the predicted value and the actual return.
- $\text{Var}(R)$ is the variance of the actual returns.

**Interpretation**

- **Values close to 1:** A high explained variance (close to 1) indicates that the value function predictions are closely aligned with the actual returns. This means the value function is accurately predicting the returns.
- **Values close to 0:** An explained variance close to 0 indicates that the value function does not explain the variability in the returns any better than simply using the mean of the returns. This suggests that the value function is not performing well.
- **Negative Values:** Negative values indicate that the predictions are worse than using the mean of the returns, which can happen if the model is overfitting or if there is a mismatch in the scale of predictions and actual returns.

In summary, explained variance is a key metric for assessing the accuracy of the value function in reinforcement learning. It helps in understanding how well the value function is predicting the actual returns, which in turn affects the quality of policy updates and overall performance of the reinforcement learning algorithm.


#### Example of the output with comments

In [None]:
# ---------------------------------
# | rollout/                |      |
# |    ep_len_mean          | 91.6 |  # Average episode length. The mean length of episodes (in timesteps) during the rollout phase
# |    ep_rew_mean          | -165 |  # Average episode reward. The mean reward per episode
# | time/                   |      |
# |    fps                  | 2071 |  # Frames per second. It indicates the speed at which the environment is being simulated
# |    iterations           | 1    |  # Number of iterations the training loop has completed
# |    time_elapsed         | 4    |  # Time elapsed since the training started, in seconds
# |    total_timesteps      | 10240|  # Total number of timesteps that have been processed
# ---------------------------------
# ---------------------------------
# | rollout/                |      |
# |    ep_len_mean          | 96.2 |  # Mean length of episodes has increased from 91.6 to 96.2 timesteps
# |    ep_rew_mean          | -154 |  # Mean reward per episode has slightly improved from -165 to -154
# | time/                   |      |
# |    fps                  | 1183 |  # Frames per second have decreased to 1183 frames per second, indicating a slower simulation speed
# |    iterations           | 2    |  # Number of iterations has increased to 2
# |    time_elapsed         | 17   |  # Time elapsed has increased to 17 seconds
# |    total_timesteps      | 20480|  # Total number of timesteps has increased to 20480
# | train/                  |      |
# |    approx_kl            | 0.00750589|  # Approximate KL divergence between the old and new policy distributions. It measures how much the new policy deviates from the old policy. Smaller values indicate that the policies are similar, which is desired during training to ensure stable learning
# |    clip_fraction        | 0.0474    |  # Fraction of time the policy update was clipped. Clipping helps prevent large, destabilizing updates
# |    clip_range           | 0.2       |  # Clipping range for the policy updates. It ensures that the policy changes are small and stable, preventing large, potentially harmful updates
# |    entropy_loss         | -1.38     |  # Entropy loss, which measures the randomness of the policy. A higher value indicates more exploration, while lower entropy means the policy is more deterministic
# |    explained_variance   | -0.00924  |  # Explained variance of the value function. It measures how well the value function explains the variance of the returns. Here, it's -0.00924, indicating poor performance
# |    learning_rate        | 0.0003    |  # Learning rate used during training. It determines the step size for updating the policy and value function. Smaller learning rates lead to more gradual updates, while larger learning rates lead to faster but potentially unstable updates
# |    loss                 | 319       |  # Total loss value: a combination of the policy loss, value loss, and entropy loss. Lower values indicate that the model is performing better according to the loss function
# |    n_updates            | 10        |  # Number of updates that have been performed. It indicates how many times the model has been updated during training. This helps in understanding the training progress
# |    policy_gradient_loss | -0.00618  |  # Policy gradient loss (loss from the policy gradient update). It measures the loss specifically from the policy update. Negative values indicate that the policy is improving, while positive values indicate degradation
# |    value_loss           | 934       |  # Value function loss (loss from the value function update). It measures the accuracy of the value function in predicting rewards. Lower values indicate a more accurate value function
# --------------------------------------

### Evaluation

In [None]:
# Import the evaluate_policy function from stable_baselines3 for evaluating the agent's performance
from stable_baselines3.common.evaluation import evaluate_policy

# Use a separate environement for evaluation (for more details see above)
eval_env = make_vec_env("LunarLander-v2", n_envs=1, wrapper_class=gym.wrappers.TimeLimit, wrapper_kwargs={"max_episode_steps":500})

mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=5) # Evaluate the agent's policy over 5 episodes in the eval_env. The function returns the mean reward and the standard deviation of rewards

# Print a summary of the agent's performance during the evaluation episodes
print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")

mean_reward:-116.88 +/- 19.38


### Saving and Loading

In [None]:
# Save the current trained model to a file named ppo_lunarlander.zip. This is useful for preserving the model so it can be loaded and used later without having to retrain it
model.save("ppo_lunarlander")

# Remove the model from memory. It is included here to demonstrate that the model can be completely removed from memory and then reloaded from the saved file
del model

# Load the model from from the previously saved file ppo_lunarlander.zip. After this, the model object will be restored and can be used as before
model = PPO.load("ppo_lunarlander")

### Visualization

In [None]:
# For visualization
from gym.wrappers.monitoring import video_recorder  # Import video recording utility from gym
from IPython.display import HTML  # Import HTML display utility from IPython
from IPython import display  # Import display utility from IPython
import glob  # Import glob for file pattern matching
import base64, io, os, shutil  # Import base64 for encoding, io for file handling, os for operating system interactions, and shutil for file operations
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv  # Import vectorized video recorder and dummy vectorized environment from stable-baselines3

# Set SDL video driver to 'dummy' to avoid issues on headless servers (servers without a graphical interface)
os.environ['SDL_VIDEODRIVER'] = 'dummy'

In [None]:
# Remove existing 'video' directory and create a new one to store video files
shutil.rmtree('video', ignore_errors=True)
os.makedirs("video", exist_ok=True)

# Function to display the recorded video. It searches for mp4 files in the 'video' directory, reads and encodes the video in base64 format, and displays it using an HTML video tag
def show_video():
    mp4list = glob.glob('video/*.mp4')  # Get list of mp4 files in the 'video' directory
    if len(mp4list) > 0:
        mp4 = mp4list[0]  # Get the first mp4 file from the list
        video = io.open(mp4, 'r+b').read()  # Read the video file
        encoded = base64.b64encode(video)  # Encode the video in base64
        display.display(HTML(data='''<video alt="test" autoplay loop controls style="height: 400px;">
              <source src="data:video/mp4;base64,{0}" type="video/mp4" />
            </video>'''.format(encoded.decode('ascii'))))  # Display the video in an HTML video tag
    else:
        print("Could not find video")  # Print error message if no video found


# Function to record a video of a RL model's performance in the LunarLander-v2 environment. It sets up the environment and video recorder, runs the model for a specified number of steps, and records the video
def show_video_of_model():
    """
    :param env_id: (str) environment ID
    :param model: (RL model) reinforcement learning model
    :param video_length: (int) length of the video in frames
    :param prefix: (str) prefix for the video file name
    :param video_folder: (str) folder to save the video
    """
    video_length = 600  # Set video length to 600 frames
    eval_env = make_vec_env("LunarLander-v2", n_envs=1)  # Create a vectorized environment for LunarLander-v2

    # Start the video at step=0 and record 600 steps
    eval_env = VecVideoRecorder(
        eval_env,
        video_folder="video/",  # Folder to save the video
        record_video_trigger=lambda step: step == 0,  # Trigger video recording at the first step
        video_length=video_length,  # Length of the video
        name_prefix="",  # Prefix for the video file name
    )

    obs = eval_env.reset()  # Reset the environment to start
    for _ in range(video_length):
        action, _ = model.predict(obs)  # Predict the action using the model
        obs, _, _, _ = eval_env.step(action)  # Take the action in the environment and get the new observation

    # Close the video recorder
    eval_env.close()

In [None]:
show_video_of_model()

Saving video to /content/video/-step-0-to-step-600.mp4
Moviepy - Building video /content/video/-step-0-to-step-600.mp4.
Moviepy - Writing video /content/video/-step-0-to-step-600.mp4





Moviepy - Done !
Moviepy - video ready /content/video/-step-0-to-step-600.mp4


In [None]:
show_video()

  and should_run_async(code)


# $\color{limegreen}{\text{TODO: Part 1}}$

### $\color{limegreen}{\text{Task 1 (1 point)}}$


$\text{As you can see, the mean reward above is negative. Can you modify one of the hyperparameters so that the mean reward becomes positive?}$

### $\color{limegreen}{\text{Task 2 (3 points)}}$

$\text{Plot the training progress (reward over time) using matplotlib.}$
<details>
  <summary>Hint</summary>
  PPO's method $\textbf{learn}$ has a paramater $\textbf{callback}$, which is called at every step with state of the algorithm.  We can build our custom callback, for example, in the following way:
  <details>
    <summary>Show Code</summary>
  
    from stable_baselines3.common.callbacks import BaseCallback

    class RewardCallback(BaseCallback):
        def __init__(self, verbose=0):
            super(RewardCallback, self).__init__(verbose)
            self.rewards = []

        def _on_step(self) -> bool:
            if self.locals['dones'][0]:  # Check if an episode is done
                episode_reward = sum(self.locals['infos'][0]['episode']['r'])
                self.rewards.append(episode_reward)
                print(f"Episode: {len(self.rewards)}, Reward: {episode_reward}")
            return True  
  </details>
</details>

In [None]:
import matplotlib.pyplot as plt

# Plotting
plt.plot()
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Training Progress')
plt.show()

### $\color{limegreen}{\text{Task 3 (3 points)}}$

$\text{Above we have trained our PPO model on the "LunarLander-v2" environment. Now, please complete the following tasks}$
1. $\text{Train your PPO model on the}$ __["MountainCar-v0"](https://www.gymlibrary.dev/environments/classic_control/mountain_car/)__ $\text{environment.}$
2. $\text{Does your car reach the flag?}$
3. $\text{In case your car does not reach the flag, adjust the hyperparameters until it does.}$
<details>
  <summary>Hint</summary>
  Set the number of parallel environments to 16, the number of steps to run for each environment per update (n_steps) to 256, and increase the number of timesteps to 1 million.
</details>

### $\color{limegreen}{\text{Task 4 (2 points)}}$

1. $\text{Replace the PPO model with the}$ __[DQN](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html)__ $\text{model and train the agent on the same environment. Compare the results.}$
2. $\text{While using the DQN model, replace "MlpPolicy" with "CnnPolicy" and train the agent on the same environment. Do you observe anything unusual?}$

### $\color{limegreen}{\text{Task 5 (2 points)}}$

1. $\text{Install the gym library with the Atari environment, used for training agents on}$ __[Atari 2600 games](https://en.wikipedia.org/wiki/List_of_Atari_2600_games)__. $\text{To install the Atari environments, run the command}$ `!pip install -q gym[atari]`.
2. $\text{Training a RL agent on Atari games is straightforward thanks to}$ `make_atari_env` $\text{helper function. Consider, for example, the environment}$ __["ALE/MsPacman-v5"](https://www.gymlibrary.dev/environments/atari/ms_pacman/)__ $\text{(or you can choose any other you want)}$. $\text{Train your agent using the PPO model with "CnnPolicy" and provide the visualization.}$  

#Part 2: Gym and VecEnv Wrappers

## Anatomy of a gym wrapper

A gym wrapper follows the [gym](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html) interface: it has a `reset()` and `step()` method. A wrapper allows us to modify the behavior of an environment without altering its core implementation.

Because a wrapper is *around* an environment, we can access it with `self.env`, this allow to easily interact with it without modifying the original env.
There are many wrappers that have been predefined, for a complete list refer to [gym documentation](https://gymnasium.farama.org/api/wrappers/)

In [3]:
# Define a custom wrapper class that inherits from gym.Wrapper
class CustomWrapper(gym.Wrapper):
    """
    :param env: (gym.Env) Gym environment that will be wrapped
    """
    def __init__(self, env):
        """
        Initialize the wrapper.
        :param env: The environment to wrap.
        """
        # Call the parent constructor, so we can access self.env later
        super().__init__(env)

    def reset(self, **kwargs):
        """
        Reset the environment.
        :param kwargs: Additional arguments for the reset method.
        :return: Initial observation and info after reset.
        """
        # Call the reset method of the wrapped environment
        obs, info = self.env.reset(**kwargs)
        return obs, info

    def step(self, action):
        """
        Step the environment with the given action.
        :param action: (float or int) Action taken by the agent.
        :return: Tuple containing observation, reward, terminated, truncated, and info.
                 - obs: The next observation.
                 - reward: The reward received from the environment.
                 - terminated: Whether the episode has terminated.
                 - truncated: Whether the episode was truncated (e.g., max steps reached).
                 - info: Additional information.
        """
        # Take a step in the wrapped environment
        obs, reward, terminated, truncated, info = self.env.step(action)
        return obs, reward, terminated, truncated, info

  and should_run_async(code)


## First example: limit the episode length

One practical use case of a wrapper is when you want to limit the number of steps by episode, for that you will need to overwrite the `done` signal when the limit is reached. It is also a good practice to pass that information in the `info` dictionary.

In [4]:
# Define a custom wrapper class that inherits from gym.Wrapper
class TimeLimitWrapper(gym.Wrapper):
    """
    :param env: (gym.Env) Gym environment that will be wrapped
    :param max_steps: (int) Max number of steps per episode
    """
    def __init__(self, env, max_steps=100):
        """
        Initialize the wrapper.
        :param env: The environment to wrap.
        :param max_steps: The maximum number of steps per episode.
        """
        # Call the parent constructor, so we can access self.env later
        super(TimeLimitWrapper, self).__init__(env)
        self.max_steps = max_steps
        # Counter of steps per episode
        self.current_step = 0

    def reset(self, **kwargs):
        """
        Reset the environment.
        :param kwargs: Additional arguments for the reset method.
        :return: Initial observation and info after reset.
        """
        # Reset the counter
        self.current_step = 0
        # Call the reset method of the wrapped environment
        return self.env.reset(**kwargs)

    def step(self, action):
        """
        Step the environment with the given action.
        :param action: (float or int) Action taken by the agent.
        :return: Tuple containing observation, reward, terminated, truncated, and info.
                 - obs: The next observation.
                 - reward: The reward received from the environment.
                 - terminated: Whether the episode has terminated.
                 - truncated: Whether the episode was truncated (e.g., max steps reached).
                 - info: Additional information.
        """
        self.current_step += 1
        # Take a step in the wrapped environment
        obs, reward, terminated, truncated, info = self.env.step(action)

        # Overwrite the truncation signal when the number of steps reaches the maximum
        if self.current_step >= self.max_steps:
            truncated = True

        return obs, reward, terminated, truncated, info

#### Test the wrapper

Let's test our wrapper on the __[Pendulum](https://www.gymlibrary.dev/environments/classic_control/pendulum/)__ environment.

In [6]:
# Import the Pendulum environment from gymnasium
from gymnasium.envs.classic_control.pendulum import PendulumEnv

# Here we create the environment directly because gym.make() already wraps environments with a default TimeLimit wrapper.
# If we were to use gym.make(), the environment would already have this TimeLimit wrapper applied, which might conflict with or make redundant the custom TimeLimitWrapper we are trying to test.
# By creating the environment directly, we avoid the automatic application of the default TimeLimit wrapper, allowing you to explicitly wrap the environment with your custom TimeLimitWrapper.
env = PendulumEnv()

# Wrap the environment with our custom TimeLimitWrapper
env = TimeLimitWrapper(env, max_steps=100)

# Reset the environment and get the initial observation
obs, _ = env.reset()
done = False
n_steps = 0

# Loop until the episode is done
while not done:
    # Take random actions
    random_action = env.action_space.sample()
    # Step the environment with the random action
    obs, reward, terminated, truncated, info = env.step(random_action)
    # Update the done flag if the episode is terminated or truncated
    done = terminated or truncated
    # Increment the step counter
    n_steps += 1

# Print the number of steps taken and the info dictionary
print(n_steps, info)

100 {}


In practice (as we saw above), `gym` already have a wrapper for that named `TimeLimit` (`gym.wrappers.TimeLimit`) that is used by most environments.

## Second example: normalize actions

It is usually a good idea to normalize observations and actions before giving it to the agent, this prevents this [hard to debug issue](https://github.com/hill-a/stable-baselines/issues/473).

In this example, we are going to normalize the action space of __["Pendulum-v1"](https://www.gymlibrary.dev/environments/classic_control/pendulum/)__ so all actions lie in [-1, 1] instead of [-2, 2].

Note: here we are dealing with continuous actions, hence the `gym.Box` space

In [7]:
# Import numpy for numerical operations
import numpy as np

# Define a custom wrapper class that inherits from gym.Wrapper
class NormalizeActionWrapper(gym.Wrapper):
    """
    :param env: (gym.Env) Gym environment that will be wrapped
    """
    def __init__(self, env):
        """
        Initialize the wrapper.
        :param env: The environment to wrap.
        """
        # Retrieve the action space
        action_space = env.action_space
        assert isinstance(action_space, gym.spaces.Box), "This wrapper only works with continuous action space (spaces.Box)"

        # Retrieve the max/min values
        self.low, self.high = action_space.low, action_space.high

        # We modify the action space, so all actions will lie in [-1, 1]
        env.action_space = gym.spaces.Box(low=-1, high=1, shape=action_space.shape, dtype=np.float32)

        # Call the parent constructor, so we can access self.env later
        super(NormalizeActionWrapper, self).__init__(env)

    def rescale_action(self, scaled_action):
        """
        Rescale the action from [-1, 1] to [low, high]
        (no need for symmetric action space)
        :param scaled_action: (np.ndarray) The action to rescale
        :return: (np.ndarray) The rescaled action
        """
        return self.low + 0.5 * (scaled_action + 1.0) * (self.high - self.low)

    def reset(self, **kwargs):
        """
        Reset the environment.
        :param kwargs: Additional arguments for the reset method.
        :return: Initial observation and info after reset.
        """
        return self.env.reset(**kwargs)

    def step(self, action):
        """
        Step the environment with the given action.
        :param action: (float or int) Action taken by the agent.
        :return: Tuple containing observation, reward, terminated, truncated, and info.
                 - obs: The next observation.
                 - reward: The reward received from the environment.
                 - terminated: Whether the episode has terminated.
                 - truncated: Whether the episode was truncated (e.g., max steps reached).
                 - info: Additional information.
        """
        # Rescale the action from [-1, 1] to the original [low, high] interval
        rescaled_action = self.rescale_action(action)
        # Take a step in the wrapped environment
        obs, reward, terminated, truncated, info = self.env.step(rescaled_action)
        return obs, reward, terminated, truncated, info

  and should_run_async(code)


#### Test before rescaling actions

In [8]:
# Create an instance of the Pendulum-v1 environment using gym.make()
original_env = gym.make("Pendulum-v1")

# Print the lower bound of the action space
print("Action space lower bound:", original_env.action_space.low)

# Sample and print 10 actions from the action space to see their values
for _ in range(10):
    # Sample a random action from the action space
    action = original_env.action_space.sample()
    # Print the sampled action
    print("Sampled action:", action)

Action space lower bound: [-2.]
Sampled action: [-0.12697212]
Sampled action: [1.3001432]
Sampled action: [-1.3474667]
Sampled action: [-0.77796227]
Sampled action: [1.4073026]
Sampled action: [-0.42040053]
Sampled action: [1.7825118]
Sampled action: [1.4175887]
Sampled action: [-1.9813699]
Sampled action: [0.5613942]


#### Test the NormalizeAction wrapper

In [9]:
# Wrap the Pendulum-v1 environment with the NormalizeActionWrapper
env = NormalizeActionWrapper(gym.make("Pendulum-v1"))

# Print the lower bound of the modified action space
print("Normalized action space lower bound:", env.action_space.low)

# Sample and print 10 actions from the modified action space to see their normalized values
for _ in range(10):
    # Sample a random action from the modified action space
    action = env.action_space.sample()
    # Print the sampled normalized action
    print("Sampled normalized action:", action)

Normalized action space lower bound: [-1.]
Sampled normalized action: [0.41116357]
Sampled normalized action: [0.11860874]
Sampled normalized action: [0.92284876]
Sampled normalized action: [-0.97425205]
Sampled normalized action: [-0.40990657]
Sampled normalized action: [0.33467343]
Sampled normalized action: [-0.3220679]
Sampled normalized action: [0.08024178]
Sampled normalized action: [0.11177742]
Sampled normalized action: [-0.19933397]


#### Test with a RL algorithm

We are going to use the Monitor wrapper of stable baselines, which allow to monitor training stats (mean episode reward, mean episode length)

In [13]:
# Import the necessary modules from Stable Baselines3
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import DummyVecEnv

# Wrap the Pendulum-v1 environment with the Monitor wrapper to record training statistics (like mean episode reward, mean episode length, etc.)
env = Monitor(gym.make("Pendulum-v1"))

# Wrap the monitored environment with the DummyVecEnv to handle vectorized environments. This is necessary because many RL algorithms in Stable Baselines3 expect the environment to be vectorized. The DummyVecEnv makes the environment look like a vectorized environment with a single instance
env = DummyVecEnv([lambda: env])

# Create and train the PPO model with the MlpPolicy
model = PPO("MlpPolicy", env, verbose=1).learn(int(1000))

Using cuda device
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.22e+03 |
| time/              |           |
|    fps             | 620       |
|    iterations      | 1         |
|    time_elapsed    | 3         |
|    total_timesteps | 2048      |
----------------------------------


With the action wrapper

In [14]:
# Wrap the Pendulum-v1 environment with the Monitor wrapper to record training statistics like before
normalized_env = Monitor(gym.make("Pendulum-v1"))

# Apply the NormalizeActionWrapper to the monitored environment
normalized_env = NormalizeActionWrapper(normalized_env)

# Wrap the normalized environment with the DummyVecEnv to handle vectorized environments
normalized_env = DummyVecEnv([lambda: normalized_env])

# Create and train the PPO model with the MlpPolicy on the normalized environment
model_N = PPO("MlpPolicy", normalized_env, verbose=1).learn(int(1000))

Using cuda device
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.28e+03 |
| time/              |           |
|    fps             | 771       |
|    iterations      | 1         |
|    time_elapsed    | 2         |
|    total_timesteps | 2048      |
----------------------------------


# $\color{limegreen}{\text{TODO: Part 2}}$

### $\color{limegreen}{\text{Task 6 (2 points)}}$

1. $\text{Modify the}$ `CustomWrapper` $\text{to print the total number of steps taken after each reset.}$
2. $\text{Implement a new method}$ `get_episode_length` $\text{in the}$ `CustomWrapper` $\text{to return the number of steps in the current episode.}$

### $\color{limegreen}{\text{Task 7 (4 points)}}$

1. $\text{Train the PPO algorithm on the "Pendulum-v1" environment with and without the}$ `NormalizeActionWrapper`.
2. $\text{Plot the training rewards for both cases and analyze the impact of action normalization on the training performance.}$
<details>
  <summary>Hint</summary>
  See Task 2.
</details>

### $\color{limegreen}{\text{Task 8 (3 points)}}$

$\text{In the "MountainCar-v0" environment, we can create a custom reward function that encourages the agent to reach the flag faster. A suggestion is the following:}$

In [None]:
class CustomMountainCarReward(gym.Wrapper):
    def __init__(self, env):
        super(CustomMountainCarReward, self).__init__(env)

    def step(self, action):
        obs, reward, terminated, truncated, info = self.env.step(action)
        position = obs[0]
        if position >= self.env.goal_position:
            reward = 100  # Reward for reaching the flag
        else:
            reward = -1  # Negative reward for each step taken
        return obs, reward, terminated, truncated, info

1. $\text{Test the custom reward function and compare the performance with the original reward function.}$
<details>
  <summary>Hint</summary>
  See Task 2.
</details>
2. $\text{Plot and analyze the training rewards with the custom reward function.}$