## Gymnasium - An API standard for reinforcement learning with a diverse collection of reference environments

Documents: https://gymnasium.farama.org/introduction/train_agent/

Gymnasium is a project that provides an API (application programming interface) for all single agent reinforcement learning environments, with implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more.

# Stable Baselines3 - Training, Saving and Loading

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)


[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

Examples with Collab Code: [https://stable-baselines3.readthedocs.io/en/master/guide/examples.html](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html)



# 1. Open a Terminal and install the following packages:

sudo apt-get install build-essential python-dev-is-python3 swig python3-pygame git

# 2. autoformatting and install box2d-py and stable-baselines3

In [None]:
# for autoformatting
# %load_ext jupyter_black


In [None]:
# Use pip to install box2d-py and stable-baselines3[extra] which required >= 2.0.0a4
# and gymnasium[other] which includes pymovie

!pip install box2d-py
!pip install "stable-baselines3[extra]>=2.0.0a4" "gymnasium[other]"


Collecting moviepy>=1.0.0 (from gymnasium[other])
  Downloading moviepy-2.2.1-py3-none-any.whl.metadata (6.9 kB)
Collecting seaborn>=0.13 (from gymnasium[other])
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting imageio<3.0,>=2.5 (from moviepy>=1.0.0->gymnasium[other])
  Downloading imageio-2.37.0-py3-none-any.whl.metadata (5.2 kB)
Collecting imageio_ffmpeg>=0.2.0 (from moviepy>=1.0.0->gymnasium[other])
  Downloading imageio_ffmpeg-0.6.0-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting proglog<=1.0.0 (from moviepy>=1.0.0->gymnasium[other])
  Downloading proglog-0.1.12-py3-none-any.whl.metadata (794 bytes)
Collecting python-dotenv>=0.10 (from moviepy>=1.0.0->gymnasium[other])
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading moviepy-2.2.1-py3-none-any.whl (129 kB)
Downloading imageio-2.37.0-py3-none-any.whl (315 kB)
Downloading proglog-0.1.12-py3-none-any.whl (6.3 kB)
Downloading imageio_ffmpeg-0.6.0-py3-none-manylinux

## Import policy, RL agent and create directories

In [1]:
import gymnasium as gym
import numpy as np

from stable_baselines3 import DQN
from stable_baselines3.common.callbacks import CheckpointCallback


In [None]:
# Create directories for models, videos and tb_logs

import os

model_dir = "models/dqn_lunar"
video_final_dir = "videos/final"
video_progress_dir = "videos/progress"
log_dir = "tb_logs"

os.makedirs(model_dir, exist_ok=True)
os.makedirs(video_final_dir, exist_ok=True)
os.makedirs(video_progress_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)

## Create the Gym env and instantiate the agent

For this example, we will use Lunar Lander environment.

"Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine. "

Lunar Lander environment: [https://gymnasium.farama.org/environments/box2d/lunar_lander/](https://gymnasium.farama.org/environments/box2d/lunar_lander/)

![Lunar Lander](https://cdn-images-1.medium.com/max/960/1*f4VZPKOI0PYNWiwt0la0Rg.gif)


We chose the MlpPolicy because input of Lunar Lander is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space



In [None]:
# Seperate env for evaluation which have a RecordVideo wrappers
eval_env = gym.make("LunarLander-v3", render_mode="rgb_array")
eval_env = gym.wrappers.RecordVideo(eval_env, video_folder=video_progress_dir, episode_trigger=lambda ep: ep==0)

# Create a evaluation model
eval_model = DQN(
    "MlpPolicy",
    eval_env,
    verbose=1,
    exploration_final_eps=0.1,
    target_update_interval=250,
    tensorboard_log=log_dir,
)

  logger.warn(


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


We load a helper function to evaluate the agent:

In [None]:
# import a helper function: evaluate_policy to evaluate the policy

from stable_baselines3.common.evaluation import evaluate_policy

Let's evaluate the un-trained agent, this should be a random agent.

In [None]:
# Separate env for evaluation, using render_mode="rgb_array"

# Before training, how agent is performed and its mean of rewards and std of rewards

# print out its mean of rewards and std of rewards and video saved in video_dir

mean_reward, std_reward = evaluate_policy(
    eval_model,
    eval_env,
    n_eval_episodes=1,
    deterministic=True,
    render=False,
)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")



mean_reward=-105.92 +/- 0.0


## Train the agent and save it

Warning: this may take a while

In [None]:
# Trigger video creation every 1000 episode
def episode_trigger(ep):
    return ep % 50 == 0

env = gym.make("LunarLander-v3", render_mode="rgb_array")
env = gym.wrappers.RecordVideo(eval_env, video_folder=video_progress_dir, episode_trigger=episode_trigger)

# Create a DQN model

model = DQN(
    "MlpPolicy",
    env,
    verbose=1,
    exploration_final_eps=0.1,
    target_update_interval=250,
    tensorboard_log=log_dir,
)

# Save every 10000 steps
checkpoint_callback = CheckpointCallback(
    save_path="models/dqn_lunar",
    save_freq=10000,
    name_prefix="dqn_lunar",
)
# Start training the agent with timesteps of 100000, using checkpoint callback and setting tensorboard log name to "dqn_lunar"

model.learn(total_timesteps=int(1e5), callback=checkpoint_callback, tb_log_name=log_dir)

# Save the agent optionally
model.save("dqn_lunar_final")

# delete trained model to demonstrate loading
del model  

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Logging to tb_logs/dqn_lunar_2
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 105      |
|    ep_rew_mean      | -102     |
|    exploration_rate | 0.962    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 19       |
|    time_elapsed     | 21       |
|    total_timesteps  | 419      |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 1.09     |
|    n_updates        | 79       |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 97.1     |
|    ep_rew_mean      | -147     |
|    exploration_rate | 0.93     |
| time/               |          |
|    episodes         | 8        |
|    fps              | 31       |
|    time_elapsed     | 24       |
|    total_timesteps  | 777      |


## Load the trained agent

In [None]:
# load a specific checkpoint file dqn_lunar_final.zip

load_model = DQN.load("dqn_lunar_final")

In [None]:
# Create evaluation env and create a video using RecordVideo wrapper

eval_env = gym.make("LunarLander-v3", render_mode="rgb_array")
eval_env = gym.wrappers.RecordVideo(eval_env, video_folder=video_final_dir, episode_trigger=lambda ep: ep==0)

In [None]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(load_model, eval_env, n_eval_episodes=1, deterministic=True, render=False)

# print out the mean of rewards and std of rewards after training
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")



mean_reward=90.46 +/- 0.0


In [None]:
# close the environment

eval_env.close()

# ==============================================
# Final Task: If time is allowed, increase the training frequency such that the mean_reward score can go up to 100+ mark. Can you tell how many times you need to train the model ?