<a href="https://colab.research.google.com/github/abcardoso/ifes_ai/blob/main/IA_GYM_SB_MountainCar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://github.com/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/stable_baselines_getting_started.ipynb

# Stable Baselines3 Tutorial - Getting Started

Github repo: https://github.com/araffin/rl-tutorial-jnrr19

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines.readthedocs.io/en/master/

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo


[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

## Introduction

Name Ana Cardoso - IFES 2024.2 - IA - Professor Sergio Nery


In this notebook, you will learn the basics for using stable baselines3 library: how to create a RL model, train it and evaluate it. Because all algorithms share the same interface, we will see how simple it is to switch from one algorithm to another.

We will also apply reinforcement learning to train agents to solve challanges in the environments:  
- Car Racing: https://gymnasium.farama.org/environments/box2d/car_racing/

- Mountain Car: https://gymnasium.farama.org/environments/classic_control/mountain_car/


## Install Dependencies and Stable Baselines3 Using Pip

List of full dependencies can be found in the [README](https://github.com/DLR-RM/stable-baselines3).


```
pip install stable-baselines3[extra]
```

In [1]:
# for autoformatting
# %load_ext jupyter_black

In [2]:
!apt-get update && }apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!pip install "stable-baselines3[extra]" #>=2.0.0a4"


Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Ign:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy Release
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
/bin/bash: l

In [3]:
import stable_baselines3

print(f"{stable_baselines3.__version__=}")

stable_baselines3.__version__='2.3.2'


## Imports

Stable-Baselines works on environments that follow the [gym interface](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html).
You can find a list of available environment [here](https://gym.openai.com/envs/#classic_control).

It is also recommended to check the [source code](https://github.com/openai/gym) to learn more about the observation and action space of each env, as gym does not have a proper documentation.
Not all algorithms can work with all action spaces, you can find more in this [recap table](https://stable-baselines.readthedocs.io/en/master/guide/algos.html)

In [4]:
import gymnasium as gym
import numpy as np

print(f"{gym.__version__=}")

gym.__version__='0.29.1'


  and should_run_async(code)


The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

The next thing you need to import is the policy class that will be used to create the networks (for the policy/value functions).
This step is optional as you can directly use strings in the constructor:

```PPO('MlpPolicy', env)``` instead of ```PPO(MlpPolicy, env)```

Note that some algorithms like `SAC` have their own `MlpPolicy`, that's why using string for the policy is the recommened option.

In [5]:
from stable_baselines3 import DQN, PPO, A2C, SAC
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common import policies



## Create the Gym env and instantiate the agent

Here we are using the [Proximal Policy Optimization](https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html) algorithm, which is an Actor-Critic method: it uses a value function to improve the policy gradient descent (by reducing the variance).

It combines ideas from [A2C](https://stable-baselines.readthedocs.io/en/master/modules/a2c.html) (having multiple workers and using an entropy bonus for exploration) and [TRPO](https://stable-baselines.readthedocs.io/en/master/modules/trpo.html) (it uses a trust region to improve stability and avoid catastrophic drops in performance).

PPO is an on-policy algorithm, which means that the trajectories used to update the networks must be collected using the latest policy.
It is usually less sample efficient than off-policy alorithms like [DQN](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), [SAC](https://stable-baselines.readthedocs.io/en/master/modules/sac.html) or [TD3](https://stable-baselines.readthedocs.io/en/master/modules/td3.html), but is much faster regarding wall-clock time.

```python
env = gym.make("CarRacing-v2", render_mode="rgb_array")

model = PPO('CnnPolicy', env, verbose=1)
```

Create the Gym env (for learning) and instantiate the agent

In [6]:
#del model

In [7]:
# %% [code]
!apt-get update
!apt-get install -y build-essential
!apt-get install -y swig

!apt-get update && apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!pip install stable-baselines3[extra]
!pip install gymnasium
!pip install gymnasium[box2d]
!pip install gymnasium[classic_control]

Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Ign:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy Release
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading pack

In [8]:
#del model

In [9]:
# Create and normalize the environment
env = gym.make("MountainCar-v0", render_mode="rgb_array")
#env = DummyVecEnv([lambda: env])
#env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)

#model = DQN(
#    "MlpPolicy",
#    env,
#    verbose=2,
#    learning_rate=5e-4,            # Adjusted learning rate
#    buffer_size=100000,            # Larger replay buffer
#    learning_starts=1000,
#    batch_size=64,                 # Increased batch size
#    tau=0.005,
#    gamma=0.98,                    # Slightly lower gamma
#    train_freq=4,
#    gradient_steps=1,
#    target_update_interval=1000,
#    exploration_fraction=0.1,      # Adjusted exploration
#    exploration_final_eps=0.01,    # Lower final epsilon
#    policy_kwargs=dict(net_arch=[256, 256]),  # Larger network
#    seed=2
#)


#model = A2C(
#    "MlpPolicy",
#    env,
#    verbose=1,
#    learning_rate=7e-4,
#    n_steps=5,
#    gamma=0.99,
#    gae_lambda=1.0,
#    vf_coef=0.25,
#    ent_coef=0.01,
#    max_grad_norm=0.5,
#    rms_prop_eps=1e-5,
#    use_rms_prop=True,
#    seed=2,
#)

#model = PPO(
#    "MlpPolicy",
#    env,
#    verbose=1,
#    learning_rate=3e-4,
#    n_steps=2048,
#    batch_size=64,
#    n_epochs=10,
#    gamma=0.99,
#    gae_lambda=0.95,
#    clip_range=0.2,
#    ent_coef=0.0,
#    vf_coef=0.5,
#    max_grad_norm=0.5,
#    seed=2,
#)

model = DQN(
    "MlpPolicy",
    env,
    verbose=2,
    train_freq=20,
    gradient_steps=8,
    gamma=1.0,
    exploration_fraction=0.2,
    exploration_final_eps=0.02,
    target_update_interval=600,
    learning_starts=1000,
    buffer_size=10000,
    batch_size=128,
    learning_rate=4e-3,
    policy_kwargs=dict(net_arch=[256, 256]),
    seed=2,
)
#n_states = 40
#iter_max = 10000
#initial_lr = 1.0
#min_lr = 0.003
#gamma = 1.0
#t_max = 10000
#eps = 0.02


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [10]:
model.get_parameters()

  and should_run_async(code)


{'policy': OrderedDict([('q_net.q_net.0.weight',
               tensor([[ 0.1622, -0.1683],
                       [ 0.1939, -0.0361],
                       [ 0.3021,  0.1683],
                       [-0.0813, -0.5717],
                       [ 0.1614, -0.6260],
                       [ 0.0929,  0.0470],
                       [-0.1555,  0.5782],
                       [ 0.0472,  0.2932],
                       [ 0.2992, -0.4171],
                       [-0.2718,  0.6800],
                       [-0.6926, -0.0480],
                       [-0.0560,  0.5016],
                       [-0.0672,  0.1862],
                       [-0.0339, -0.3959],
                       [-0.4008, -0.3435],
                       [-0.6423, -0.4589],
                       [ 0.1664,  0.4654],
                       [ 0.0348, -0.3241],
                       [ 0.3108, -0.2714],
                       [-0.1566, -0.3876],
                       [-0.2220, -0.6552],
                       [ 0.3017,  0.2749],
     

Let's evaluate the un-trained agent, this should be a random agent.

In [11]:
# Use a separate environement for evaluation
eval_env = gym.make("MountainCar-v0", render_mode="rgb_array")

# Random Agent, before training
#eval_env = DummyVecEnv([lambda: eval_env])
#eval_env = VecNormalize(eval_env, training=False, norm_obs=True, norm_reward=True, clip_obs=10.)

mean_reward, std_reward = evaluate_policy(model, eval_env, deterministic=True, n_eval_episodes=20)
print(f"Untrained mean_reward: {mean_reward:.2f} +/- {std_reward:.2f}")




Untrained mean_reward: -200.00 +/- 0.00


## Train the agent and evaluate it

In [12]:
# Train the agent
model.learn(int(1.48e5), log_interval=10, progress_bar=True)     #### train the agent until convergence and then analyse the learned q-value function.

#model.learn(total_timesteps=500000, log_interval=10)

Output()

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 200      |
|    ep_rew_mean      | -200     |
|    exploration_rate | 0.934    |
| time/               |          |
|    episodes         | 10       |
|    fps              | 785      |
|    time_elapsed     | 2        |
|    total_timesteps  | 2000     |
| train/              |          |
|    learning_rate    | 0.004    |
|    loss             | 0.000136 |
|    n_updates        | 392      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 200      |
|    ep_rew_mean      | -200     |
|    exploration_rate | 0.868    |
| time/               |          |
|    episodes         | 20       |
|    fps              | 541      |
|    time_elapsed     | 7        |
|    total_timesteps  | 4000     |
| train/              |          |
|    learning_rate    | 0.004    |
|    loss             | 2.4e-06  |
|    n_updates      

<stable_baselines3.dqn.dqn.DQN at 0x79e2546ca7d0>

In [13]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=200) #x

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")


mean_reward:-100.94 +/- 8.83


Apparently the training went well, the mean reward increased a lot !

### Prepare video recording

In [14]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [15]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay


def show_videos(video_path="", prefix=""):
    """
    Taken from https://github.com/eleurent/highway-env

    :param video_path: (str) Path to the folder containing videos
    :param prefix: (str) Filter the video, showing only the only starting with this prefix
    """
    html = []
    for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append(
            """<video alt="{}" autoplay
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>""".format(
                mp4, video_b64.decode("ascii")
            )
        )
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

We will record a video using the [VecVideoRecorder](https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html#vecvideorecorder) wrapper, you will learn about those wrapper in the next notebook.

In [16]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv


def record_video1(env_id, model, video_length=1000, prefix="", video_folder="videos/"):
    """
    :param env_id: (str)
    :param model: (RL model)
    :param video_length: (int)
    :param prefix: (str)
    :param video_folder: (str)
    """
    eval_env = DummyVecEnv([lambda: gym.make("MountainCar-v0", render_mode="rgb_array")])
    # Start the video at step=0 and record <video_length> steps
    eval_env = VecVideoRecorder(
        eval_env,
        video_folder=video_folder,
        record_video_trigger=lambda step: step == 0,
        video_length=video_length,
        name_prefix=prefix,
    )

    obs = eval_env.reset()
    for _ in range(video_length):
        action, _ = model.predict(obs)
        obs, _, _, _ = eval_env.step(action)

    # Close the video recorder
    eval_env.close()

### Visualize trained agent



In [17]:
record_video1("MountainCar-v0", model, video_length=1000, prefix="mountaincar")

Saving video to /content/videos/mountaincar-step-0-to-step-1000.mp4
Moviepy - Building video /content/videos/mountaincar-step-0-to-step-1000.mp4.
Moviepy - Writing video /content/videos/mountaincar-step-0-to-step-1000.mp4





Moviepy - Done !
Moviepy - video ready /content/videos/mountaincar-step-0-to-step-1000.mp4


In [18]:
show_videos("videos", prefix="mountaincar")

  and should_run_async(code)


## Conclusion

In this notebook we have seen:
- how to define and train a RL model using stable baselines3 ;)
